1 Summary

The work develops two models for predicting the sale price of houses in the Ames Housing Dataset:

1.1 Primary variables of the multilinear model

The most important variables for the multilinear model are GrLivArea, Functional, Condition1, MSZoning, and SaleCondition.

GrLivArea: Above grade (ground) living area square feet

Relative importance: 100%
Comment: The log of sale price varies linearly with the log of living area.


Functional: Home functionality (Assume typical unless deductions are warranted)

       Typ      Typical Functionality
       Min1     Minor Deductions 1
       Min2     Minor Deductions 2
       Mod      Moderate Deductions
       Maj1     Major Deductions 1
       Maj2     Major Deductions 2
       Sev      Severely Damaged
       Sal      Salvage only

Relative importance: 71.9%
Comment: Any drop below typical functionality causes a drop in price, but
    the data does not show a clear dependence of price on the level
    of functionality loss.


Condition1: Proximity to various conditions

       Artery   Adjacent to arterial street
       Feedr    Adjacent to feeder street
       Norm     Normal
       RRNn     Within 200' of North-South Railroad
       RRAn     Adjacent to North-South Railroad
       PosN     Near positive off-site feature--park, greenbelt, etc.
       PosA     Adjacent to postive off-site feature
       RRNe     Within 200' of East-West Railroad
       RRAe     Adjacent to East-West Railroad

Relative importance: 51.9%
Comment: Being near or adjacent to a park, greenbelt, etc gives a boost
    in price, while being adjacent to a feeder street or arterial street
    gives a drop in price.  Research is needed to clarify the way
    in which proximity to railroads affects price.


MSZoning: Identifies the general zoning classification of the sale.

       A        Agriculture
       C        Commercial
       FV       Floating Village Residential
       I        Industrial
       RH       Residential High Density
       RL       Residential Low Density
       RP       Residential Low Density Park
       RM       Residential Medium Density

Relative importance: 38%
Comment: Most houses are in a low-density residential zone.  Houses in a
    "Floating Village" residential zone get a boost in price, while
    houses in medium- or high-density residential zones get a price
    penalty.  Houses in commercial zones get a substantial price penalty.


SaleCondition: Condition of sale

       Normal   Normal Sale
       Abnorml  Abnormal Sale -  trade, foreclosure, short sale
       AdjLand  Adjoining Land Purchase
       Alloca   Allocation - two linked properties with separate deeds,
                typically condo with a garage unit
       Family   Sale between family members
       Partial  Home was not completed when last assessed
                (associated with New Homes)

Relative importance: 36.8%
Comment: Houses with an abnormal sale take a price penalty, while new homes
    in the Partial category have a price boost.

2 Exploratory data analysis / feature engineering

Feature engineering was done in an iterative process.

2.1 Missing data

Summary

After preprocessing that repairs some inconsistencies in the data, display the count of missing values for individual variables.

Discussion

From the variable descriptions provided in data_description.txt, “NA” is a valid category for certain columns.

Consistency checks revealed that “NA” sometimes indicates missing data, even in variables where it represents a valid category.

  • Example: The description of variable BsmtExposure has “NA” indicating “No Basement”. But the house with Id 949 has “NA” for BsmtExposure in spite of the fact that BsmtQual is “Good”, BsmtCond is “Typical”, and TotalBsmtSF (“Total square feet of basement area”) is 936. For this house, the “NA” for BsmtExposure likely indicates a missing value rather than “No Basement.”

The script preprocess.R attempts to repair inconsistencies and replace “NA” by a string such as “NB” for “No Basement” in cases where it represented a valid category.

  • In the preprocessed data, only the variable LotFrontage (“Linear feet of street connected to property”) has a significant number of missing values.

Missing value counts

read_data <- function(file) {
    data <- read.csv(file, stringsAsFactors = FALSE)
    return(data)
}
train_data <- read_data('data/train.csv')
test_data <- read_data('data/test.csv')
get_na_counts <- function(data, dataset_name) {
    na_counts <- data.frame(name = colnames(data),
                            count = colSums(is.na(data)),
                            dataset = dataset_name,
                            stringsAsFactors = FALSE) %>%
        arrange(desc(count)) %>%
        filter(count != 0)
    return(na_counts)
}

na_counts_train <- get_na_counts(train_data, 'train')
na_counts_test <- get_na_counts(test_data, 'test')
combined_na_counts <- rbind(na_counts_test, na_counts_train,
                            make.row.names = FALSE) %>%
    arrange(desc(count), desc(dataset), name) %>%
    rename(`missing value count` = count)

# The same (complicated) set of options can be used with DT::datatable
# throughout this file.  A function wrapper is defined for this.
get_datatable <- function(data, page_length = 20) {
    table_alignment <- list(className = 'dt-center', targets = '_all')
    table_with_options <- datatable(
        data,
        rownames = FALSE,
        options = list(
            columnDefs = list(table_alignment),
            pageLength = page_length
         )
    )
    return(table_with_options)
}
get_datatable(combined_na_counts)

2.2 Variable distributions

Summary

Apply transformations and plot the distribution of each variable.

Discussion

The distribution of each variable is plotted.

  • Numeric predictors with only a few values are labeled as discrete, while those with many values are labeled as continuous.
  • For variables that were transformed, the distributions before and after the transformation are both shown, along with the formula for the transformation.
  • The variable descriptions from data_description.txt are included with the plots.

Transformation of the target variable and the predictor variables GrLivArea, LotArea decreased skew.

Several variables have a distribution concentrated into a single value or a small subset of the values.

  • For example, most houses have zero pool area.

Target

to_plot <- select(train_data, SalePrice) %>%
    mutate(SalePrice = SalePrice / 1000)
fig <- ggplot(to_plot,
              aes(x = SalePrice)) +
    geom_histogram(fill = 'navyblue', bins = 30) +
    scale_x_continuous(name = 'SalePrice',
                       labels = label_dollar(suffix = 'k')) +
    ggtitle('SalePrice')
print(fig)

to_plot <- mutate(to_plot, SalePrice = log(SalePrice))
fig <- ggplot(select(to_plot, SalePrice),
              aes(x = SalePrice)) +
    geom_histogram(fill = 'navyblue', bins = 30) +
    scale_x_continuous(name = 'log(SalePrice)') +
    ggtitle('log(SalePrice)')
print(fig)
cat('transformation:  price -> log(price)\n\n')

transformation:  price -> log(price)

Discrete predictors

# Given a vector of variable names, reorder the names to match the order used in
# the original data set (i.e., in csv files and data_description.txt).
all_column_names <- colnames(train_data)
reorder_names <- function(variable_names) {
    reorder_index <- order(match(variable_names, all_column_names))
    variable_names <- variable_names[reorder_index]
    return(variable_names)
}

# For numeric variables with only a few values (such as YrSold), it is not
# initially clear whether the variable should be treated as categorical or
# numeric in, say, a linear model.  For plots, however, it is natural to treat
# these variables as categorical initially.
#
# Define three classes of variables:  categorical, continuous, and discrete
# numeric.  Here 'continuous' is used for numeric variables containing many
# values.

# Note that MSSubClass is a nominal variable labeled with nonconsecutive integer
# values, and it should be treated as categorical.
train_data <- mutate(train_data, MSSubClass = as.factor(MSSubClass))
categorical_predictors <- reorder_names(colnames(
    select(train_data, where(is.character), MSSubClass)
))

discrete_numeric <- reorder_names(c(
    'OverallQual', 'OverallCond', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
    'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces',
    'GarageCars', 'MoSold', 'YrSold'
))

continuous_predictors <- reorder_names(setdiff(
    all_column_names,
    c(categorical_predictors, discrete_numeric, 'Id', 'SalePrice')
))

# Helper function that finds the total number of characters needed for the
# labels on the horizontal axis.  If this number is too large, the labels are
# rotated.
nchar_labels <- function(vector_to_plot) {
    # Drop missing values.
    vector_to_plot <- vector_to_plot[!is.na(vector_to_plot)]
    char_vector <- as.character(vector_to_plot)
    label_lengths <- nchar(unique(char_vector))
    return(sum(label_lengths))
}

# Helper function that rotates the labels of the horizontal axis if necessary.
set_label_angle <- function(fig, vector_to_plot) {
    if (nchar_labels(vector_to_plot) > 60) {
        fig <- fig + theme(
            axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)
        )
    }
    return(fig)
}

# Helper function used in setting x-axis labels.
get_discrete_values <- function(vector_to_plot) {
    axis_labels <- seq(min(vector_to_plot),
                       max(vector_to_plot))
    return(axis_labels)
}

to_plot_categorical <- reorder_names(
    c(categorical_predictors, discrete_numeric)
)
for (variable_name in to_plot_categorical) {
    # The list variable_descriptions is defined in 'descriptions.R'
    description <- variable_descriptions[[variable_name]]
    to_plot <- select(train_data, all_of(variable_name)) %>%
        drop_na()

    # In cases where the variable is an integer, converting it to a factor
    # before plotting decreases the amount of tweaking that is needed for the
    # plot.  However, there may be missing values in the horizontal axis.  In
    # the training set, for instance, there are no samples that have 7
    # bedrooms above ground, so 6 ends up next to 8 without special handling.
    # In order to avoid this, use the limits argument of scale_x_discrete.
    is_int <- is.integer(to_plot[[variable_name]])
    if (is_int) {
        x_axis_labels <- get_discrete_values(to_plot[[variable_name]]) %>%
            as.character
        to_plot[[variable_name]] <- as.factor(to_plot[[variable_name]])
    }

    fig <- ggplot(to_plot, aes_string(x = variable_name)) +
        geom_bar(fill = 'navyblue') +
        ggtitle(variable_name)
    if (is_int) {
        fig <- fig + scale_x_discrete(limits = x_axis_labels)
    } else {
        fig <- set_label_angle(fig, to_plot[[variable_name]])
    }
    print(fig)
    cat(description)
}

MSSubClass: Identifies the type of dwelling involved in the sale.

        20      1-STORY 1946 & NEWER ALL STYLES
        30      1-STORY 1945 & OLDER
        40      1-STORY W/FINISHED ATTIC ALL AGES
        45      1-1/2 STORY - UNFINISHED ALL AGES
        50      1-1/2 STORY FINISHED ALL AGES
        60      2-STORY 1946 & NEWER
        70      2-STORY 1945 & OLDER
        75      2-1/2 STORY ALL AGES
        80      SPLIT OR MULTI-LEVEL
        85      SPLIT FOYER
        90      DUPLEX - ALL STYLES AND AGES
       120      1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150      1-1/2 STORY PUD - ALL AGES
       160      2-STORY PUD - 1946 & NEWER
       180      PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190      2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: Identifies the general zoning classification of the sale.

       A        Agriculture
       C        Commercial
       FV       Floating Village Residential
       I        Industrial
       RH       Residential High Density
       RL       Residential Low Density
       RP       Residential Low Density Park
       RM       Residential Medium Density

Street: Type of road access to property

       Grvl     Gravel
       Pave     Paved

Alley: Type of alley access to property

       Grvl     Gravel
       Pave     Paved
       None     No alley access

LotShape: General shape of property

       Reg      Regular
       IR1      Slightly irregular
       IR2      Moderately Irregular
       IR3      Irregular

LandContour: Flatness of the property

       Lvl      Near Flat/Level
       Bnk      Banked - Quick and significant rise from street grade to building
       HLS      Hillside - Significant slope from side to side
       Low      Depression

Utilities: Type of utilities available

       AllPub   All public Utilities (E,G,W,& S)
       NoSewr   Electricity, Gas, and Water (Septic Tank)
       NoSeWa   Electricity and Gas Only
       ELO      Electricity only

LotConfig: Lot configuration

       Inside   Inside lot
       Corner   Corner lot
       CulDSac  Cul-de-sac
       FR2      Frontage on 2 sides of property
       FR3      Frontage on 3 sides of property

LandSlope: Slope of property

       Gtl      Gentle slope
       Mod      Moderate Slope
       Sev      Severe Slope

Neighborhood: Physical locations within Ames city limits

       Blmngtn  Bloomington Heights
       Blueste  Bluestem
       BrDale   Briardale
       BrkSide  Brookside
       ClearCr  Clear Creek
       CollgCr  College Creek
       Crawfor  Crawford
       Edwards  Edwards
       Gilbert  Gilbert
       IDOTRR   Iowa DOT and Rail Road
       MeadowV  Meadow Village
       Mitchel  Mitchell
       Names    North Ames
       NoRidge  Northridge
       NPkVill  Northpark Villa
       NridgHt  Northridge Heights
       NWAmes   Northwest Ames
       OldTown  Old Town
       SWISU    South & West of Iowa State University
       Sawyer   Sawyer
       SawyerW  Sawyer West
       Somerst  Somerset
       StoneBr  Stone Brook
       Timber   Timberland
       Veenker  Veenker

Condition1: Proximity to various conditions

       Artery   Adjacent to arterial street
       Feedr    Adjacent to feeder street
       Norm     Normal
       RRNn     Within 200' of North-South Railroad
       RRAn     Adjacent to North-South Railroad
       PosN     Near positive off-site feature--park, greenbelt, etc.
       PosA     Adjacent to postive off-site feature
       RRNe     Within 200' of East-West Railroad
       RRAe     Adjacent to East-West Railroad

Condition2: Proximity to various conditions (if more than one is present)

       Artery   Adjacent to arterial street
       Feedr    Adjacent to feeder street
       Norm     Normal
       RRNn     Within 200' of North-South Railroad
       RRAn     Adjacent to North-South Railroad
       PosN     Near positive off-site feature--park, greenbelt, etc.
       PosA     Adjacent to postive off-site feature
       RRNe     Within 200' of East-West Railroad
       RRAe     Adjacent to East-West Railroad

BldgType: Type of dwelling

       1Fam     Single-family Detached
       2FmCon   Two-family Conversion; originally built as one-family dwelling
       Duplx    Duplex
       TwnhsE   Townhouse End Unit
       TwnhsI   Townhouse Inside Unit

HouseStyle: Style of dwelling

       1Story   One story
       1.5Fin   One and one-half story: 2nd level finished
       1.5Unf   One and one-half story: 2nd level unfinished
       2Story   Two story
       2.5Fin   Two and one-half story: 2nd level finished
       2.5Unf   Two and one-half story: 2nd level unfinished
       SFoyer   Split Foyer
       SLvl     Split Level

OverallQual: Rates the overall material and finish of the house

       10       Very Excellent
       9        Excellent
       8        Very Good
       7        Good
       6        Above Average
       5        Average
       4        Below Average
       3        Fair
       2        Poor
       1        Very Poor

OverallCond: Rates the overall condition of the house

       10       Very Excellent
       9        Excellent
       8        Very Good
       7        Good
       6        Above Average
       5        Average
       4        Below Average
       3        Fair
       2        Poor
       1        Very Poor

RoofStyle: Type of roof

       Flat     Flat
       Gable    Gable
       Gambrel  Gabrel (Barn)
       Hip      Hip
       Mansard  Mansard
       Shed     Shed

RoofMatl: Roof material

       ClyTile  Clay or Tile
       CompShg  Standard (Composite) Shingle
       Membran  Membrane
       Metal    Metal
       Roll     Roll
       Tar&Grv  Gravel & Tar
       WdShake  Wood Shakes
       WdShngl  Wood Shingles

Exterior1st: Exterior covering on house

       AsbShng  Asbestos Shingles
       AsphShn  Asphalt Shingles
       BrkComm  Brick Common
       BrkFace  Brick Face
       CBlock   Cinder Block
       CemntBd  Cement Board
       HdBoard  Hard Board
       ImStucc  Imitation Stucco
       MetalSd  Metal Siding
       Other    Other
       Plywood  Plywood
       PreCast  PreCast
       Stone    Stone
       Stucco   Stucco
       VinylSd  Vinyl Siding
       Wd Sdng  Wood Siding
       WdShing  Wood Shingles

Exterior2nd: Exterior covering on house (if more than one material)

       AsbShng  Asbestos Shingles
       AsphShn  Asphalt Shingles
       BrkComm  Brick Common
       BrkFace  Brick Face
       CBlock   Cinder Block
       CemntBd  Cement Board
       HdBoard  Hard Board
       ImStucc  Imitation Stucco
       MetalSd  Metal Siding
       Other    Other
       Plywood  Plywood
       PreCast  PreCast
       Stone    Stone
       Stucco   Stucco
       VinylSd  Vinyl Siding
       Wd Sdng  Wood Siding
       WdShing  Wood Shingles

MasVnrType: Masonry veneer type

       BrkCmn   Brick Common
       BrkFace  Brick Face
       CBlock   Cinder Block
       None     None
       Stone    Stone

ExterQual: Evaluates the quality of the material on the exterior

       Ex       Excellent
       Gd       Good
       TA       Average/Typical
       Fa       Fair
       Po       Poor

ExterCond: Evaluates the present condition of the material on the exterior

       Ex       Excellent
       Gd       Good
       TA       Average/Typical
       Fa       Fair
       Po       Poor

Foundation: Type of foundation

       BrkTil   Brick & Tile
       CBlock   Cinder Block
       PConc    Poured Contrete
       Slab     Slab
       Stone    Stone
       Wood     Wood

BsmtQual: Evaluates the height of the basement

       Ex       Excellent (100+ inches)
       Gd       Good (90-99 inches)
       TA       Typical (80-89 inches)
       Fa       Fair (70-79 inches)
       Po       Poor (<70 inches
       NB       No Basement

BsmtCond: Evaluates the general condition of the basement

       Ex       Excellent
       Gd       Good
       TA       Typical - slight dampness allowed
       Fa       Fair - dampness or some cracking or settling
       Po       Poor - Severe cracking, settling, or wetness
       NB       No Basement

BsmtExposure: Refers to walkout or garden level walls

       Gd       Good Exposure
       Av       Average Exposure (split levels or foyers typically score average
                or above)
       Mn       Mimimum Exposure
       No       No Exposure
       NB       No Basement

BsmtFinType1: Rating of basement finished area

       GLQ      Good Living Quarters
       ALQ      Average Living Quarters
       BLQ      Below Average Living Quarters
       Rec      Average Rec Room
       LwQ      Low Quality
       Unf      Unfinshed
       NB       No Basement

BsmtFinType2: Rating of basement finished area (if multiple types)

       GLQ      Good Living Quarters
       ALQ      Average Living Quarters
       BLQ      Below Average Living Quarters
       Rec      Average Rec Room
       LwQ      Low Quality
       Unf      Unfinshed
       NB       No Basement

Heating: Type of heating

       Floor    Floor Furnace
       GasA     Gas forced warm air furnace
       GasW     Gas hot water or steam heat
       Grav     Gravity furnace
       OthW     Hot water or steam heat other than gas
       Wall     Wall furnace

HeatingQC: Heating quality and condition

       Ex       Excellent
       Gd       Good
       TA       Average/Typical
       Fa       Fair
       Po       Poor

CentralAir: Central air conditioning

       N        No
       Y        Yes

Electrical: Electrical system

       SBrkr    Standard Circuit Breakers & Romex
       FuseA    Fuse Box over 60 AMP and all Romex wiring (Average)
       FuseF    60 AMP Fuse Box and mostly Romex wiring (Fair)
       FuseP    60 AMP Fuse Box and mostly knob & tube wiring (poor)
       Mix      Mixed

BsmtFullBath: Basement full bathrooms

BsmtHalfBath: Basement half bathrooms

FullBath: Full bathrooms above grade

HalfBath: Half baths above grade

BedroomAbvGr: Bedrooms above grade (does NOT include basement bedrooms)

KitchenAbvGr: Kitchens above grade

KitchenQual: Kitchen quality

       Ex       Excellent
       Gd       Good
       TA       Typical/Average
       Fa       Fair
       Po       Poor

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

Functional: Home functionality (Assume typical unless deductions are warranted)

       Typ      Typical Functionality
       Min1     Minor Deductions 1
       Min2     Minor Deductions 2
       Mod      Moderate Deductions
       Maj1     Major Deductions 1
       Maj2     Major Deductions 2
       Sev      Severely Damaged
       Sal      Salvage only

Fireplaces: Number of fireplaces

FireplaceQu: Fireplace quality

       Ex       Excellent - Exceptional Masonry Fireplace
       Gd       Good - Masonry Fireplace in main level
       TA       Average - Prefabricated Fireplace in main living area
                or Masonry Fireplace in basement
       Fa       Fair - Prefabricated Fireplace in basement
       Po       Poor - Ben Franklin Stove
       No       No Fireplace

GarageType: Garage location

       2Types   More than one type of garage
       Attchd   Attached to home
       Basment  Basement Garage
       BuiltIn  Built-In (Garage part of house - typically has room above garage)
       CarPort  Car Port
       Detchd   Detached from home
       None     No Garage

GarageFinish: Interior finish of the garage

       Fin      Finished
       RFn      Rough Finished
       Unf      Unfinished
       No       No Garage

GarageCars: Size of garage in car capacity

GarageQual: Garage quality

       Ex       Excellent
       Gd       Good
       TA       Typical/Average
       Fa       Fair
       Po       Poor
       No       No Garage

GarageCond: Garage condition

       Ex       Excellent
       Gd       Good
       TA       Typical/Average
       Fa       Fair
       Po       Poor
       No       No Garage

PavedDrive: Paved driveway

       Y        Paved
       P        Partial Pavement
       N        Dirt/Gravel

PoolQC: Pool quality

       Ex       Excellent
       Gd       Good
       TA       Average/Typical
       Fa       Fair
       No       No Pool

Fence: Fence quality

       GdPrv    Good Privacy
       MnPrv    Minimum Privacy
       GdWo     Good Wood
       MnWw     Minimum Wood/Wire
       No       No Fence

MiscFeature: Miscellaneous feature not covered in other categories

       Elev     Elevator
       Gar2     2nd Garage (if not described in garage section)
       Othr     Other
       Shed     Shed (over 100 SF)
       TenC     Tennis Court
       None     None

MoSold: Month Sold (MM)

YrSold: Year Sold (YYYY)

SaleType: Type of sale

       WD       Warranty Deed - Conventional
       CWD      Warranty Deed - Cash
       VWD      Warranty Deed - VA Loan
       New      Home just constructed and sold
       COD      Court Officer Deed/Estate
       Con      Contract 15% Down payment regular terms
       ConLw    Contract Low Down payment and low interest
       ConLI    Contract Low Interest
       ConLD    Contract Low Down
       Oth      Other

SaleCondition: Condition of sale

       Normal   Normal Sale
       Abnorml  Abnormal Sale -  trade, foreclosure, short sale
       AdjLand  Adjoining Land Purchase
       Alloca   Allocation - two linked properties with separate deeds,
                typically condo with a garage unit
       Family   Sale between family members
       Partial  Home was not completed when last assessed
                (associated with New Homes)

Continuous predictors

# From the consistency checks done by preprocess.R, there are no missing
# values for GarageYrBlt in eiether the training data or the test data.  The
# preprocessing script set this variable to zero for houses without a garage.
# For convenience in dropping data for plotting, temporarily replace zero by
# NA.
bool_index <- train_data$GarageYrBlt == 0
train_data[bool_index, 'GarageYrBlt'] <- NA

# Function used for transforming YearBuilt and GarageYrBlt
transform_year <- function(x) {
    return((x - 1900)^3 / 1e6)
}
transform_year_formula <- 'year -> (year - 1900)^3 / 1e6'
transform_area_formula <- 'area -> log(area)'

# The information associated with transforming variables is entered as a list
# of lists.  In order to avoid awkward syntax in accessing elements of nested
# list, however, the nested list is used to generated simple 1D lists and
# vectors.  (The reason for using the nested list initially is to group
# together the information for a particular variable.)
transformations <- list(
    GrLivArea = list(func = log,
                     label = 'log(GrLivArea)',
                     formula = transform_area_formula),
    YearBuilt = list(func = transform_year,
                     label = 'transformed YearBuilt',
                     formula = transform_year_formula),
    GarageYrBlt = list(func = transform_year,
                       label = 'transformed GarageYrBlt',
                       formula = transform_year_formula),
    LotArea = list(func = log,
                   label = 'log(LotArea)',
                   formula = transform_area_formula)
)
transformed_variables <- names(transformations)
transformed_variable_labels <- map_chr(transformations, ~ (.)$label)
transformation_functions <- map(transformations, ~ (.)$func)
transformation_formulas <- map_chr(transformations, ~ (.)$formula)

get_variable_label <- function(variable_name) {
    if (variable_name %in% transformed_variables) {
        variable_label <- transformed_variable_labels[variable_name]
    } else {
        variable_label <- variable_name
    }
    return(variable_label)
}

for (variable_name in continuous_predictors) {
    description <- variable_descriptions[[variable_name]]
    to_plot <- select(train_data, all_of(variable_name)) %>%
        drop_na()

    fig <- ggplot(to_plot, aes_string(x = variable_name)) +
        geom_histogram(fill = 'navyblue', bins = 30) +
        ggtitle(variable_name)
    print(fig)

    if (variable_name %in% transformed_variables) {
        # Apply the transformation to the variable.
        func <- transformation_functions[[variable_name]]
        formula <- transformation_formulas[variable_name]
        to_plot[[variable_name]] <- func(
            to_plot[[variable_name]]
        )

        variable_label <- get_variable_label(variable_name)
        fig <- ggplot(to_plot,
                      aes_string(x = variable_name)) +
            geom_histogram(fill = 'navyblue', bins = 30) +
            scale_x_continuous(name = variable_label) +
            ggtitle(variable_label)
        print(fig)

        cat('transformation: ', formula, '\n\n')
    }

    cat(description)
}

LotFrontage: Linear feet of street connected to property

transformation:  area -> log(area) 

LotArea: Lot size in square feet

transformation:  year -> (year - 1900)^3 / 1e6 

YearBuilt: Original construction date

YearRemodAdd: Remodel date (same as construction date if no remodeling or
              additions)

MasVnrArea: Masonry veneer area in square feet

BsmtFinSF1: Type 1 finished square feet

BsmtFinSF2: Type 2 finished square feet

BsmtUnfSF: Unfinished square feet of basement area

TotalBsmtSF: Total square feet of basement area

FirstFlrSF: First Floor square feet

SecondFlrSF: Second floor square feet

LowQualFinSF: Low quality finished square feet (all floors)

transformation:  area -> log(area) 

GrLivArea: Above grade (ground) living area square feet

transformation:  year -> (year - 1900)^3 / 1e6 

GarageYrBlt: Year garage was built

GarageArea: Size of garage in square feet

WoodDeckSF: Wood deck area in square feet

OpenPorchSF: Open porch area in square feet

EnclosedPorch: Enclosed porch area in square feet

ThreeSsnPorch: Three season porch area in square feet

ScreenPorch: Screen porch area in square feet

PoolArea: Pool area in square feet

MiscVal: $Value of miscellaneous feature

2.3 Which predictors can explain the variance in the sale price?

Summary

Show the fraction of variance in the sale price explained by individual predictors.

Discussion

ANOVA (\(\eta^2\)) is used for nominal predictors, and \(R^2\) is used for numeric predictors.

For many of the predictors that individually explain the largest fraction of variance, the value of \(\eta^2\) or \(R^2\) was increased by the transformations.

  • Example: the highest value of \(R^2\) (OverallQual) increased from 0.63 to 0.67 due to the transformation of SalePrice.

Nominal predictors

# Apply transformations to variables in the training data.  The R chunks that
# generate large plots are cached, and so transformations in those chunks
# were done to temporary plotting data frames.
apply_transformations <- function(data) {
    if ('SalePrice' %in% colnames(data)) {
        data <- mutate(data, SalePrice = log(SalePrice))
    }

    for (variable_name in transformed_variables) {
        # Apply the transformation to the variable.
        func <- transformation_functions[[variable_name]]
        data[[variable_name]] <- func(data[[variable_name]])
    }

    return(data)
}

train_data <- apply_transformations(train_data)

get_linear_model <- function(variable_name,
                             dummy_filter = '',
                             outliers = integer()) {
    # For categorical variables, perform ANOVA and return eta squared along
    # with predictions based on the model.  For numeric variables, do a simple
    # linear regression and return R squared along with predictions.
    #
    # Note that the command aov used for ANOVA in R is a wrapper to lm that
    # provides a summary in the traditional ANOVA form.  This summary doesn't
    # include eta squared.  However, the value is available as r.squared from
    # the summary for lm.  So if aov is used to generate the model, then eta
    # squared is available as summary.lm(model)$r.squared.
    #
    # In the current function, use the lm command both for categorical and
    # numeric predictors.
    data <- filter(train_data, !(Id %in% outliers))
    model_formula <- paste0('SalePrice ~ ', variable_name)
    if  (nchar(dummy_filter) > 0) {
        model_formula <- paste(model_formula, '+', dummy_filter)
        vars <- c(variable_name, dummy_filter)
        data <- select(data, SalePrice, all_of(vars)) %>%
            drop_na()
    } else {
        data <- select(data, SalePrice, all_of(variable_name)) %>%
            drop_na()
    }
    model_formula <- as.formula(model_formula)
    model <-  lm(model_formula, data)
    model$formula <- model_formula
    model$dummy_filter <- dummy_filter
    model$outliers <- outliers
    return(model)
}

anova_models <- map(categorical_predictors, get_linear_model)
names(anova_models) <- categorical_predictors
eta_squared_vec <- map_dbl(anova_models, ~ summary(.)$r.squared)
eta_squared_df <- data.frame(name = categorical_predictors,
                             eta.squared = eta_squared_vec,
                             stringsAsFactors = FALSE) %>%
    arrange(desc(eta.squared)) %>%
    rename(`eta squared` = eta.squared)
get_datatable(eta_squared_df) %>%
    formatRound(columns = 'eta squared', digits = 3)

Numeric predictors

numeric_predictors <- reorder_names(
    c(continuous_predictors, discrete_numeric)
)

linear_models <- map(numeric_predictors, get_linear_model)
names(linear_models) <- numeric_predictors
r_squared_vec <- map_dbl(linear_models, ~ summary(.)$r.squared)
r_squared_df <- data.frame(name = numeric_predictors,
                           r.squared = r_squared_vec,
                           stringsAsFactors = FALSE) %>%
    arrange(desc(r.squared)) %>%
    rename(`R squared` = r.squared)
get_datatable(r_squared_df) %>%
    formatRound(columns = 'R squared', digits = 3)

2.4 Plots of sale price vs individual predictors

Summary

Visualize the dependence of sale price on individual predictors.

Discussion

Predictors with large \(\eta^2\) or \(R^2\) are plotted against sale price.

  • Here “large” is defined to mean \(\gtrsim 0.1\).
  • For both \(\eta^2\) and \(R^2\), there is a gap in the values at around 0.1, and predictors falling above the gap are plotted.
  • Predictor distributions and descriptions are also shown.

The linear profile of several of the predictors improved dramatically due to the transformations.

  • For example, the transformation caused the median values shown in many of the box plots to fall roughly on a line.

Box plots

large_eta_squared <- filter(eta_squared_df, `eta squared` >= 0.085)

generate_box_plot <- function(data, variable_name, eta_squared = NULL,
                              show_encoding = FALSE) {

    description <- variable_descriptions[[variable_name]]
    if (show_encoding) {
        encoded_variable_name <- paste0(variable_name, 'Num')
        vars <- c(variable_name, encoded_variable_name)
        to_plot <- select(data, SalePrice, all_of(vars)) %>%
            rename(encoding = .data[[encoded_variable_name]]) %>%
            drop_na()
    } else {
        to_plot <- select(data, SalePrice, all_of(variable_name)) %>%
            drop_na()
    }
    to_plot[[variable_name]] <- as.factor(to_plot[[variable_name]]) %>%
        reorder(to_plot$SalePrice, median)

    if (!is.null(eta_squared)) {
        plot_title <- paste0(variable_name, ', ', '$\\eta^2 = ',
                             eta_squared, '$')
    } else {
        plot_title <- variable_name
    }
    fig <- ggplot(to_plot, aes_string(x = variable_name)) +
        geom_boxplot(aes_string(y = 'SalePrice', fill = variable_name))
    if (show_encoding) {
        fig <- fig +
            geom_point(aes(y = encoding, color = 'encodings')) +
            scale_color_manual(name = variable_name,
                               values = c('encodings' = 'red')) +
            guides(fill = FALSE)
    }
    fig <- fig + scale_y_continuous(name = 'log(SalePrice)') +
        scale_x_discrete(name = variable_name) +
        ggtitle(TeX(plot_title))
    fig <- set_label_angle(fig, to_plot[[variable_name]])
    print(fig)

    # Bar chart of the distribution
    plot_title <- paste(variable_name, 'distribution')
    fig <- ggplot(to_plot, aes_string(x = variable_name)) +
        geom_bar(fill = 'navyblue') +
        scale_x_discrete(name = variable_name) +
        ggtitle(plot_title)
    fig <- set_label_angle(fig, to_plot[[variable_name]])
    print(fig)

    cat(description)
}
for (variable_name in large_eta_squared$name) {
    eta_squared <- round(eta_squared_vec[variable_name], digits = 3)
    generate_box_plot(train_data, variable_name, eta_squared)
}

Neighborhood: Physical locations within Ames city limits

       Blmngtn  Bloomington Heights
       Blueste  Bluestem
       BrDale   Briardale
       BrkSide  Brookside
       ClearCr  Clear Creek
       CollgCr  College Creek
       Crawfor  Crawford
       Edwards  Edwards
       Gilbert  Gilbert
       IDOTRR   Iowa DOT and Rail Road
       MeadowV  Meadow Village
       Mitchel  Mitchell
       Names    North Ames
       NoRidge  Northridge
       NPkVill  Northpark Villa
       NridgHt  Northridge Heights
       NWAmes   Northwest Ames
       OldTown  Old Town
       SWISU    South & West of Iowa State University
       Sawyer   Sawyer
       SawyerW  Sawyer West
       Somerst  Somerset
       StoneBr  Stone Brook
       Timber   Timberland
       Veenker  Veenker

ExterQual: Evaluates the quality of the material on the exterior

       Ex       Excellent
       Gd       Good
       TA       Average/Typical
       Fa       Fair
       Po       Poor

BsmtQual: Evaluates the height of the basement

       Ex       Excellent (100+ inches)
       Gd       Good (90-99 inches)
       TA       Typical (80-89 inches)
       Fa       Fair (70-79 inches)
       Po       Poor (<70 inches
       NB       No Basement

KitchenQual: Kitchen quality

       Ex       Excellent
       Gd       Good
       TA       Typical/Average
       Fa       Fair
       Po       Poor

GarageFinish: Interior finish of the garage

       Fin      Finished
       RFn      Rough Finished
       Unf      Unfinished
       No       No Garage

GarageType: Garage location

       2Types   More than one type of garage
       Attchd   Attached to home
       Basment  Basement Garage
       BuiltIn  Built-In (Garage part of house - typically has room above garage)
       CarPort  Car Port
       Detchd   Detached from home
       None     No Garage

MSSubClass: Identifies the type of dwelling involved in the sale.

        20      1-STORY 1946 & NEWER ALL STYLES
        30      1-STORY 1945 & OLDER
        40      1-STORY W/FINISHED ATTIC ALL AGES
        45      1-1/2 STORY - UNFINISHED ALL AGES
        50      1-1/2 STORY FINISHED ALL AGES
        60      2-STORY 1946 & NEWER
        70      2-STORY 1945 & OLDER
        75      2-1/2 STORY ALL AGES
        80      SPLIT OR MULTI-LEVEL
        85      SPLIT FOYER
        90      DUPLEX - ALL STYLES AND AGES
       120      1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150      1-1/2 STORY PUD - ALL AGES
       160      2-STORY PUD - 1946 & NEWER
       180      PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190      2 FAMILY CONVERSION - ALL STYLES AND AGES

FireplaceQu: Fireplace quality

       Ex       Excellent - Exceptional Masonry Fireplace
       Gd       Good - Masonry Fireplace in main level
       TA       Average - Prefabricated Fireplace in main living area
                or Masonry Fireplace in basement
       Fa       Fair - Prefabricated Fireplace in basement
       Po       Poor - Ben Franklin Stove
       No       No Fireplace

Foundation: Type of foundation

       BrkTil   Brick & Tile
       CBlock   Cinder Block
       PConc    Poured Contrete
       Slab     Slab
       Stone    Stone
       Wood     Wood

HeatingQC: Heating quality and condition

       Ex       Excellent
       Gd       Good
       TA       Average/Typical
       Fa       Fair
       Po       Poor

BsmtFinType1: Rating of basement finished area

       GLQ      Good Living Quarters
       ALQ      Average Living Quarters
       BLQ      Below Average Living Quarters
       Rec      Average Rec Room
       LwQ      Low Quality
       Unf      Unfinshed
       NB       No Basement

MasVnrType: Masonry veneer type

       BrkCmn   Brick Common
       BrkFace  Brick Face
       CBlock   Cinder Block
       None     None
       Stone    Stone

Exterior1st: Exterior covering on house

       AsbShng  Asbestos Shingles
       AsphShn  Asphalt Shingles
       BrkComm  Brick Common
       BrkFace  Brick Face
       CBlock   Cinder Block
       CemntBd  Cement Board
       HdBoard  Hard Board
       ImStucc  Imitation Stucco
       MetalSd  Metal Siding
       Other    Other
       Plywood  Plywood
       PreCast  PreCast
       Stone    Stone
       Stucco   Stucco
       VinylSd  Vinyl Siding
       Wd Sdng  Wood Siding
       WdShing  Wood Shingles

MSZoning: Identifies the general zoning classification of the sale.

       A        Agriculture
       C        Commercial
       FV       Floating Village Residential
       I        Industrial
       RH       Residential High Density
       RL       Residential Low Density
       RP       Residential Low Density Park
       RM       Residential Medium Density

Exterior2nd: Exterior covering on house (if more than one material)

       AsbShng  Asbestos Shingles
       AsphShn  Asphalt Shingles
       BrkComm  Brick Common
       BrkFace  Brick Face
       CBlock   Cinder Block
       CemntBd  Cement Board
       HdBoard  Hard Board
       ImStucc  Imitation Stucco
       MetalSd  Metal Siding
       Other    Other
       Plywood  Plywood
       PreCast  PreCast
       Stone    Stone
       Stucco   Stucco
       VinylSd  Vinyl Siding
       Wd Sdng  Wood Siding
       WdShing  Wood Shingles

GarageCond: Garage condition

       Ex       Excellent
       Gd       Good
       TA       Typical/Average
       Fa       Fair
       Po       Poor
       No       No Garage

BsmtExposure: Refers to walkout or garden level walls

       Gd       Good Exposure
       Av       Average Exposure (split levels or foyers typically score average
                or above)
       Mn       Mimimum Exposure
       No       No Exposure
       NB       No Basement

GarageQual: Garage quality

       Ex       Excellent
       Gd       Good
       TA       Typical/Average
       Fa       Fair
       Po       Poor
       No       No Garage

SaleCondition: Condition of sale

       Normal   Normal Sale
       Abnorml  Abnormal Sale -  trade, foreclosure, short sale
       AdjLand  Adjoining Land Purchase
       Alloca   Allocation - two linked properties with separate deeds,
                typically condo with a garage unit
       Family   Sale between family members
       Partial  Home was not completed when last assessed
                (associated with New Homes)

CentralAir: Central air conditioning

       N        No
       Y        Yes

SaleType: Type of sale

       WD       Warranty Deed - Conventional
       CWD      Warranty Deed - Cash
       VWD      Warranty Deed - VA Loan
       New      Home just constructed and sold
       COD      Court Officer Deed/Estate
       Con      Contract 15% Down payment regular terms
       ConLw    Contract Low Down payment and low interest
       ConLI    Contract Low Interest
       ConLD    Contract Low Down
       Oth      Other

HouseStyle: Style of dwelling

       1Story   One story
       1.5Fin   One and one-half story: 2nd level finished
       1.5Unf   One and one-half story: 2nd level unfinished
       2Story   Two story
       2.5Fin   Two and one-half story: 2nd level finished
       2.5Unf   Two and one-half story: 2nd level unfinished
       SFoyer   Split Foyer
       SLvl     Split Level

Electrical: Electrical system

       SBrkr    Standard Circuit Breakers & Romex
       FuseA    Fuse Box over 60 AMP and all Romex wiring (Average)
       FuseF    60 AMP Fuse Box and mostly Romex wiring (Fair)
       FuseP    60 AMP Fuse Box and mostly knob & tube wiring (poor)
       Mix      Mixed

PavedDrive: Paved driveway

       Y        Paved
       P        Partial Pavement
       N        Dirt/Gravel

LotShape: General shape of property

       Reg      Regular
       IR1      Slightly irregular
       IR2      Moderately Irregular
       IR3      Irregular

BsmtCond: Evaluates the general condition of the basement

       Ex       Excellent
       Gd       Good
       TA       Typical - slight dampness allowed
       Fa       Fair - dampness or some cracking or settling
       Po       Poor - Severe cracking, settling, or wetness
       NB       No Basement

Scatter plots

large_r_squared <- filter(r_squared_df, `R squared` >= 0.095)

# Reusable plotting function
plot_simple_fit <- function(variable_name, linear_model, r_squared = 0) {
    outliers <- linear_model$outliers
    dummy_filter <- linear_model$dummy_filter
    description <- variable_descriptions[[variable_name]]
    to_plot <- filter(train_data, !(Id %in% outliers))

    if (nchar(dummy_filter) > 0) {
        to_plot <- filter(to_plot, .data[[dummy_filter]] == 0) %>%
            select(SalePrice, all_of(c(variable_name, dummy_filter))) %>%
            drop_na()
    } else {
        to_plot <-  select(to_plot, SalePrice, all_of(variable_name)) %>%
            drop_na()
    }

    prediction <- predict(linear_model, to_plot)

    to_plot <- mutate(to_plot, PredictedPrice = prediction)

    # Scatter plot
    variable_label <- get_variable_label(variable_name)
    if (as.logical(r_squared)) {
        plot_title <- paste0(
            variable_label, ', ', '$R^2 = ', r_squared, '$'
        )
    } else {
        plot_title <- variable_label
    }
    fig <- ggplot(to_plot, aes_string(x = variable_name)) +
        geom_point(aes(y = SalePrice, color = 'data')) +
        geom_line(aes(y = PredictedPrice, color = 'fit')) +
        scale_color_manual(
            name = NULL,
            values = c(
                'data' = 'navyblue',
                'fit' = 'red'
            )) +
        scale_y_continuous(name = 'log(SalePrice)') +
        ggtitle(TeX(plot_title))

    is_discrete <- variable_name %in% discrete_numeric
    if (is_discrete) {
        breaks <- get_discrete_values(to_plot[[variable_name]])
        labels <- as.character(breaks)
        fig <- fig + scale_x_continuous(
            name = variable_label, breaks = breaks, labels = labels
        )
    } else {
        fig <- fig + scale_x_continuous(name = variable_name)
    }
    print(fig)

    # Histogram or bar chart of the distribution
    plot_title <- paste(variable_label, 'distribution')
    if (is_discrete) {
        to_plot[[variable_name]] <- as.factor(to_plot[[variable_name]])
        fig <- ggplot(to_plot, aes_string(x = variable_name)) +
            geom_bar(fill = 'navyblue') +
            scale_x_discrete(limits = labels)
    } else {
        fig <- ggplot(to_plot, aes_string(x = variable_name)) +
            geom_histogram(fill = 'navyblue', bins = 30) +
            scale_x_continuous(name = variable_label)
    }
    print(fig + ggtitle(plot_title))

    if (variable_name %in% transformed_variables) {
        cat('transformation: ',
            transformation_formulas[variable_name],
            '\n\n'
        )
    }
    cat(description)
}

for (variable_name in large_r_squared$name) {
    r_squared <- round(r_squared_vec[variable_name], digits = 3)
    linear_model <- linear_models[[variable_name]]
    plot_simple_fit(variable_name, linear_model, r_squared)
}

OverallQual: Rates the overall material and finish of the house

       10       Very Excellent
       9        Excellent
       8        Very Good
       7        Good
       6        Above Average
       5        Average
       4        Below Average
       3        Fair
       2        Poor
       1        Very Poor

transformation:  area -> log(area) 

GrLivArea: Above grade (ground) living area square feet

GarageCars: Size of garage in car capacity

GarageArea: Size of garage in square feet

transformation:  year -> (year - 1900)^3 / 1e6 

YearBuilt: Original construction date

TotalBsmtSF: Total square feet of basement area

FirstFlrSF: First Floor square feet

FullBath: Full bathrooms above grade

transformation:  year -> (year - 1900)^3 / 1e6 

GarageYrBlt: Year garage was built

YearRemodAdd: Remodel date (same as construction date if no remodeling or
              additions)

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

Fireplaces: Number of fireplaces

MasVnrArea: Masonry veneer area in square feet

transformation:  area -> log(area) 

LotArea: Lot size in square feet

BsmtFinSF1: Type 1 finished square feet

LotFrontage: Linear feet of street connected to property

WoodDeckSF: Wood deck area in square feet

OpenPorchSF: Open porch area in square feet

SecondFlrSF: Second floor square feet

HalfBath: Half baths above grade

2.5 Feature engineering

The goal is to define simple features that have a strong linear association with the target.

2.5.1 Define dummy variables, exclude outliers

Summary

Enhance the linear dependence of sale price on individual numeric predictors.

Discussion

The starting point for the process is the set of predictors that have large \(R^2\).

For certain scatter plots, the fit appears distorted by outliers and/or by the points at the vertical intercept.

Distortions due to the vertical intercept can be eliminated by defining a dummy variable.

  • Example: A dummy variable NoBasement that identifies houses without a basement was defined in order to avoid distortions in the fit for variables that characterize basement area.

With outliers excluded and appropriate dummy variables defined, \(R^2\) increased and the fit shown in scatter plots improved.

Changes in \(R^2\)

# Add dummy variables to data.  From the consistency checks done by the
# preprocessing script, the variables used in defining NoBasement,
# NoGarage, NoFireplace had no missing values in the original data sets
# (both training and test).  However, the current script set houses with no
# garage to have GarageYr to NA to facilitate plotting, and so is.na is used
# to define the variable NoGarage.
define_dummies <- function(data) {
    data <- mutate(data,
                   NoBasement = as.integer(TotalBsmtSF == 0),
                   NoGarage = as.integer(is.na(GarageYrBlt)),
                   NoFireplace = as.integer(Fireplaces == 0),
                   NoSecondFloor = as.integer(SecondFlrSF == 0))
    return(data)
}
train_data <- define_dummies(train_data)

# Now that NoGarage can be used to distinguish houses without a garage,
# GarageYrBlt can be set to an arbitrary value for houses without a garage.
# Choose -0.1 as an arbitrary value to separate these houses in the plot of
# GarageYrBlt.
bool_index <- is.na(train_data$GarageYrBlt)
train_data[bool_index, 'GarageYrBlt'] <- -0.1

basement_related <- c(
    'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
    'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath',
    'BsmtHalfBath'
)
garage_related <- c(
    'GarageFinish', 'GarageQual', 'GarageCond', 'GarageType', 'GarageYrBlt',
    'GarageCars', 'GarageArea'
)
fireplace_related <- c('FireplaceQu', 'Fireplaces')
second_floor_related <- 'SecondFlrSF'

# Find the dummy variable that imposes a constraint on the values for a
# variable.
get_dummy_filter <- function(variable_name) {
    dummy <- case_when(
        variable_name %in% basement_related ~ 'NoBasement',
        variable_name %in% garage_related ~ 'NoGarage',
        variable_name %in% fireplace_related ~ 'NoFireplace',
        variable_name %in% second_floor_related ~ 'NoSecondFloor',
        TRUE ~ ''
    )
    return(dummy)
}

bool_index <- !(all_column_names %in% c('Id', 'SalePrice'))
original_predictors <- all_column_names[bool_index]
dummy_filters <- get_dummy_filter(original_predictors)
names(dummy_filters) <- original_predictors

# For certain variables, the fit is visibly off because of a few outliers for
# which the variable has a very large value.  Define a vector of formula
# strings to use for dropping outliers in evaluating the fit of individual
# variables.
outlier_formulas <- c(
    GrLivArea = '>= log(4500)',
    GarageArea = '>= 1240',
    TotalBsmtSF = '>= 3000',
    BsmtFinSF1 = '>= 3000',
    SecondFlrSF = '> 3000',
    LotFrontage = '> 300'
)

# Convert these formulas into a list of outliers.
get_outliers <- function(data, variable_name) {
    if (variable_name %in% names(outlier_formulas)) {
        formula <- parse_expr(paste(
            variable_name, outlier_formulas[[variable_name]]
        ))
        filtered_data <- filter(data, !!formula)
        outliers <- filtered_data$Id
    }
    else {
        outliers <- integer()
    }
    return(outliers)
}

variable_outliers <- map(original_predictors,
                         ~ get_outliers(train_data, .))
names(variable_outliers) <- original_predictors
# In order to check how linear fits are improved by the exclusion of outliers
# and/or the addition of dummy variables, generate new linear models.  For
# example, include NoBasement as a second predictor in the finding the
# linear fit for the predictor TotalBsmtSF, and exclude
# variable_outliers[['TotalBsmtSF']] from the fit.
#
# As new fits are generated, append values to vectors that will form a data
# frame showing changes in R squared.
original_model <- character()
updated_model <- character()
original_r_squared <- numeric()
updated_r_squared <- numeric()
for (variable_name in large_r_squared$name) {
    outliers <- variable_outliers[[variable_name]]
    dummy_filter <- dummy_filters[variable_name]
    model_update_needed <- length(outliers) > 0 | nchar(dummy_filter) > 0
    if (model_update_needed) {
        # The elements of dummy_filters are named.  For instance, in the case
        # where variable_name is 'TotalBsmtSF', the variable
        #
        # dummy_filter <- dummy_filters[variable_name]
        #
        # is the following:
        #
        # > dummy_filter
        #  TotalBsmtSF
        #  "NoBasement"
        #
        # If dummy_filter is used in a dplyr select statement, the column
        # 'NoBasement' gets selected and then renamed to TotalBsmtSF.  In
        # order to avoid this, set the name of dummy_filter to NULL.
        names(dummy_filter) <- NULL
        linear_model <- get_linear_model(
            variable_name, dummy_filter, outliers = outliers
        )
        linear_models[[variable_name]] <- linear_model

        original_model <- c(original_model, variable_name)
        if (nchar(dummy_filter) > 0) {
            updated_model <- c(updated_model,
                               paste(variable_name, '+', dummy_filter))
        } else {
            updated_model <- c(updated_model, variable_name)
        }
        original_r_squared <- c(original_r_squared,
                                r_squared_vec[variable_name])
        updated_r_squared <- c(updated_r_squared,
                               summary(linear_model)$r.squared)
    }
}

r_squared_changes <- data.frame(`original model` = original_model,
                                `original R squared` = original_r_squared,
                                `updated model` = updated_model,
                                `updated R squared` = updated_r_squared,
                                check.names = FALSE,
                                stringsAsFactors = FALSE)
get_datatable(r_squared_changes) %>%
    formatRound(columns = c('original R squared', 'updated R squared'),
                digits = 3)

Scatter plots

for (variable_name in r_squared_changes[['original model']]) {
    bool_index <- r_squared_changes[['original model']] == variable_name
    r_squared <- round(r_squared_changes[bool_index, 'updated R squared'],
                       digits = 3)
    linear_model <- linear_models[[variable_name]]
    plot_simple_fit(variable_name, linear_model, r_squared)
}

transformation:  area -> log(area) 

GrLivArea: Above grade (ground) living area square feet

GarageCars: Size of garage in car capacity

GarageArea: Size of garage in square feet

TotalBsmtSF: Total square feet of basement area

transformation:  year -> (year - 1900)^3 / 1e6 

GarageYrBlt: Year garage was built

Fireplaces: Number of fireplaces

BsmtFinSF1: Type 1 finished square feet

LotFrontage: Linear feet of street connected to property

SecondFlrSF: Second floor square feet

2.5.2 Encode nominal variables with numbers

Summary

Apply \(k\)-fold target encoding to each nominal variable. For example, encoding of the Neighborhood variable yielded a NeighborhoodNum variable.

Discussion

Target encoding encodes each category by the mean of the target values for that category.

This method is prone to overfitting, but variants have been developed to minimize overfitting:

  • \(K\)-fold target encoding randomly separates the training data into \(k\) folds. Samples in a given fold are encoded using data not in that fold.
  • For categories with a small number of samples, the encoding is a mix of the category target mean and the global target mean.

Five folds were used here, and the weighting of the global target mean was given by a sigmoid function that decreased from \(\approx 1\) to \(\approx 0\) as the number of samples in the category increased from 0 to 10.

The script encode.R created for this project supports two methods of encoding test data:

  • Given an observation and the class for that observation, randomly select one of the k encodings generated for the class.

  • All observations of a given class receive the mean of the k encodings generated for the class. This is the natural choice for a linear model, since it is equivalent to averaging the model output over many possible random encodings of the predictors.

Missing values were encoded using the global target mean.

Visualize encodings

# Encode all categorical predictors.

# Convert MSSubClass to a character vector to simplify the encoding.  Note
# that 'MSSubClass' is a factor vector, while the rest of the categorical
# predictors are character vectors.  The reason that MSSubClass is handled
# differently is that the variable is a scattered set of integers that
# represent classes.  If this variable is converted to a character early in
# the current script, the ordering of factors for plots is lexical, which
# '190' coming before '20', which makes the plotted distribution of MSSubClass
# harder to understand.  Converting it from an integer vector (imported by
# read.csv) directly to a factor automatically gives factors ordered by
# integer value in the distribution plot.
train_data <- mutate(train_data, MSSubClass = as.character(MSSubClass))

target_mean <- mean(train_data$SalePrice)

encode_predictors <- function(data, method = 'mean', drop_categorical = TRUE) {
    # The functions encode and assign_encoding are defined in the script
    # encode.R.
    if ('SalePrice' %in% colnames(data)) {
        # Encode training data.
        for (variable_name in categorical_predictors) {
            # The encode function is defined in the script encode.R.
            data <- encode(data, variable_name, target_mean)
        }
    } else {
        # Assign encodings to test data.
        for (variable_name in categorical_predictors) {
            data <- assign_encoding(data, variable_name,
                                    target_mean = target_mean, method = method)
        }
    }

    if (drop_categorical) {
        data <- select(data, -all_of(categorical_predictors))
    }

    return(data)
}

train_data <- encode_predictors(train_data, drop_categorical = FALSE)
for (variable_name in large_eta_squared$name) {
    eta_squared <- round(eta_squared_vec[variable_name], digits = 3)
    generate_box_plot(train_data, variable_name, eta_squared,
                      show_encoding = TRUE)
}

Neighborhood: Physical locations within Ames city limits

       Blmngtn  Bloomington Heights
       Blueste  Bluestem
       BrDale   Briardale
       BrkSide  Brookside
       ClearCr  Clear Creek
       CollgCr  College Creek
       Crawfor  Crawford
       Edwards  Edwards
       Gilbert  Gilbert
       IDOTRR   Iowa DOT and Rail Road
       MeadowV  Meadow Village
       Mitchel  Mitchell
       Names    North Ames
       NoRidge  Northridge
       NPkVill  Northpark Villa
       NridgHt  Northridge Heights
       NWAmes   Northwest Ames
       OldTown  Old Town
       SWISU    South & West of Iowa State University
       Sawyer   Sawyer
       SawyerW  Sawyer West
       Somerst  Somerset
       StoneBr  Stone Brook
       Timber   Timberland
       Veenker  Veenker

ExterQual: Evaluates the quality of the material on the exterior

       Ex       Excellent
       Gd       Good
       TA       Average/Typical
       Fa       Fair
       Po       Poor

BsmtQual: Evaluates the height of the basement

       Ex       Excellent (100+ inches)
       Gd       Good (90-99 inches)
       TA       Typical (80-89 inches)
       Fa       Fair (70-79 inches)
       Po       Poor (<70 inches
       NB       No Basement

KitchenQual: Kitchen quality

       Ex       Excellent
       Gd       Good
       TA       Typical/Average
       Fa       Fair
       Po       Poor

GarageFinish: Interior finish of the garage

       Fin      Finished
       RFn      Rough Finished
       Unf      Unfinished
       No       No Garage

GarageType: Garage location

       2Types   More than one type of garage
       Attchd   Attached to home
       Basment  Basement Garage
       BuiltIn  Built-In (Garage part of house - typically has room above garage)
       CarPort  Car Port
       Detchd   Detached from home
       None     No Garage

MSSubClass: Identifies the type of dwelling involved in the sale.

        20      1-STORY 1946 & NEWER ALL STYLES
        30      1-STORY 1945 & OLDER
        40      1-STORY W/FINISHED ATTIC ALL AGES
        45      1-1/2 STORY - UNFINISHED ALL AGES
        50      1-1/2 STORY FINISHED ALL AGES
        60      2-STORY 1946 & NEWER
        70      2-STORY 1945 & OLDER
        75      2-1/2 STORY ALL AGES
        80      SPLIT OR MULTI-LEVEL
        85      SPLIT FOYER
        90      DUPLEX - ALL STYLES AND AGES
       120      1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150      1-1/2 STORY PUD - ALL AGES
       160      2-STORY PUD - 1946 & NEWER
       180      PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190      2 FAMILY CONVERSION - ALL STYLES AND AGES

FireplaceQu: Fireplace quality

       Ex       Excellent - Exceptional Masonry Fireplace
       Gd       Good - Masonry Fireplace in main level
       TA       Average - Prefabricated Fireplace in main living area
                or Masonry Fireplace in basement
       Fa       Fair - Prefabricated Fireplace in basement
       Po       Poor - Ben Franklin Stove
       No       No Fireplace

Foundation: Type of foundation

       BrkTil   Brick & Tile
       CBlock   Cinder Block
       PConc    Poured Contrete
       Slab     Slab
       Stone    Stone
       Wood     Wood

HeatingQC: Heating quality and condition

       Ex       Excellent
       Gd       Good
       TA       Average/Typical
       Fa       Fair
       Po       Poor

BsmtFinType1: Rating of basement finished area

       GLQ      Good Living Quarters
       ALQ      Average Living Quarters
       BLQ      Below Average Living Quarters
       Rec      Average Rec Room
       LwQ      Low Quality
       Unf      Unfinshed
       NB       No Basement

MasVnrType: Masonry veneer type

       BrkCmn   Brick Common
       BrkFace  Brick Face
       CBlock   Cinder Block
       None     None
       Stone    Stone

Exterior1st: Exterior covering on house

       AsbShng  Asbestos Shingles
       AsphShn  Asphalt Shingles
       BrkComm  Brick Common
       BrkFace  Brick Face
       CBlock   Cinder Block
       CemntBd  Cement Board
       HdBoard  Hard Board
       ImStucc  Imitation Stucco
       MetalSd  Metal Siding
       Other    Other
       Plywood  Plywood
       PreCast  PreCast
       Stone    Stone
       Stucco   Stucco
       VinylSd  Vinyl Siding
       Wd Sdng  Wood Siding
       WdShing  Wood Shingles

MSZoning: Identifies the general zoning classification of the sale.

       A        Agriculture
       C        Commercial
       FV       Floating Village Residential
       I        Industrial
       RH       Residential High Density
       RL       Residential Low Density
       RP       Residential Low Density Park
       RM       Residential Medium Density

Exterior2nd: Exterior covering on house (if more than one material)

       AsbShng  Asbestos Shingles
       AsphShn  Asphalt Shingles
       BrkComm  Brick Common
       BrkFace  Brick Face
       CBlock   Cinder Block
       CemntBd  Cement Board
       HdBoard  Hard Board
       ImStucc  Imitation Stucco
       MetalSd  Metal Siding
       Other    Other
       Plywood  Plywood
       PreCast  PreCast
       Stone    Stone
       Stucco   Stucco
       VinylSd  Vinyl Siding
       Wd Sdng  Wood Siding
       WdShing  Wood Shingles

GarageCond: Garage condition

       Ex       Excellent
       Gd       Good
       TA       Typical/Average
       Fa       Fair
       Po       Poor
       No       No Garage

BsmtExposure: Refers to walkout or garden level walls

       Gd       Good Exposure
       Av       Average Exposure (split levels or foyers typically score average
                or above)
       Mn       Mimimum Exposure
       No       No Exposure
       NB       No Basement

GarageQual: Garage quality

       Ex       Excellent
       Gd       Good
       TA       Typical/Average
       Fa       Fair
       Po       Poor
       No       No Garage

SaleCondition: Condition of sale

       Normal   Normal Sale
       Abnorml  Abnormal Sale -  trade, foreclosure, short sale
       AdjLand  Adjoining Land Purchase
       Alloca   Allocation - two linked properties with separate deeds,
                typically condo with a garage unit
       Family   Sale between family members
       Partial  Home was not completed when last assessed
                (associated with New Homes)

CentralAir: Central air conditioning

       N        No
       Y        Yes

SaleType: Type of sale

       WD       Warranty Deed - Conventional
       CWD      Warranty Deed - Cash
       VWD      Warranty Deed - VA Loan
       New      Home just constructed and sold
       COD      Court Officer Deed/Estate
       Con      Contract 15% Down payment regular terms
       ConLw    Contract Low Down payment and low interest
       ConLI    Contract Low Interest
       ConLD    Contract Low Down
       Oth      Other

HouseStyle: Style of dwelling

       1Story   One story
       1.5Fin   One and one-half story: 2nd level finished
       1.5Unf   One and one-half story: 2nd level unfinished
       2Story   Two story
       2.5Fin   Two and one-half story: 2nd level finished
       2.5Unf   Two and one-half story: 2nd level unfinished
       SFoyer   Split Foyer
       SLvl     Split Level

Electrical: Electrical system

       SBrkr    Standard Circuit Breakers & Romex
       FuseA    Fuse Box over 60 AMP and all Romex wiring (Average)
       FuseF    60 AMP Fuse Box and mostly Romex wiring (Fair)
       FuseP    60 AMP Fuse Box and mostly knob & tube wiring (poor)
       Mix      Mixed

PavedDrive: Paved driveway

       Y        Paved
       P        Partial Pavement
       N        Dirt/Gravel

LotShape: General shape of property

       Reg      Regular
       IR1      Slightly irregular
       IR2      Moderately Irregular
       IR3      Irregular

BsmtCond: Evaluates the general condition of the basement

       Ex       Excellent
       Gd       Good
       TA       Typical - slight dampness allowed
       Fa       Fair - dampness or some cracking or settling
       Po       Poor - Severe cracking, settling, or wetness
       NB       No Basement

2.5.3 Outcome of feature engineering

Summary

List the changes applied to variables in the original data set.

Details

Changes to the variables:

  1. The target (SalePrice) and two predictors (LotArea, GrLivArea) were transformed using the logarithm.
  2. The predictors YearBuilt and GarageYrBlt were transformed using the function \(f(x) = (x - 1900)^3\).
  3. Dummy variables NoBasement, NoGarage, NoFireplace, NoSecondFloor were defined in order to eliminate distortions in the fit for related numeric variables.
  4. Outliers for individual numeric predictors were defined.
  5. Each nominal variable in the training set was encoded using \(k\)-fold target encoding with \(k=5\). For categories with \(<10\) samples, a sigmoid function mixed the global target mean with the category target mean to give the encoding. As a natural extension of this step, missing categorical data was encoded with the global target mean.

2.5.4 Additional possibilities for feature engineering

Summary

Discuss additional feature engineering that could lead to an improved multilinear model.

Discussion

New variables could be defined by eliminating poorly sampled discrete values.

  • Example: There are only a few houses in the training data with GarageCars equal to 4, and these houses all have a price lower than that predicted by the linear fit. The model would likely be improved by merging the values 3 and 4 in GarageCares, i.e., using the value 3 to represent all houses with 3 or more cars.

Methods of encoding categorical variables could be more fully explored using cross validation for a particular model.

  • Example: For k-fold target encoding, cross-validation could be used to tune the number of folds as well as the weight given to the global target mean for categories with a small number of samples.

2.6 Are the predictors independent?

Summary

Display a table of correlations between predictors as an aid to tuning the multilinear model.

Discussion

A correlation matrix was calculated that included the following 45 predictors:

  • Numeric predictors with large \(R^2\).
  • Encoded versions of categorical predictors with large \(\eta^2\).

In the table, correlations are presented in order of decreasing magnitude.

Many of the predictors that have a strong linear association with the target are also strongly correlated with other predictors.

Correlations

linear_predictors <- c(large_r_squared$name,
                       paste0(large_eta_squared$name, 'Num'))

corr_matrix <- select(train_data, all_of(linear_predictors)) %>%
    cor(use = 'pairwise.complete.obs')
# Set to NA the upper triangular part of the correlation matrix.
bool_index <- lower.tri(corr_matrix, diag = TRUE)
corr_matrix[bool_index] <- NA

correlations <- as.data.frame(corr_matrix) %>%
    mutate(var_1 = rownames(.)) %>%
    pivot_longer(-var_1, names_to = 'var_2', values_to = 'corr') %>%
    filter(!is.na(corr)) %>%
    arrange(desc(abs(corr))) %>%
    rename(`predictor 1` = var_1, `predictor 2` = var_2,
           correlation = corr)
get_datatable(correlations) %>%
    formatRound(columns = 'correlation', digits = 3)

3 Multilinear model

A descriptive model that aims to give insight into the housing market.

3.1 Stepwise regression

Summary

Use stepwise regression to select variables for the model.

Discussion

An iterative process was used to tune the stepwise regression:

  1. Exclude outliers after inspecting previous results with plot.lm. Three outliers were dropped in the final iteration.

  2. Drop variables that contain many missing values. In the first iteration, only LotFrontage was dropped, but tests found that MasVnrArea was not an important variable at any step of the regression, and so it was dropped as well. The remaining numeric and encoded categorical predictors had no missing values.

  3. Execute the stepwise regression using a data set that includes all numeric and encoded categorical predictors, together with dummy variables such as NoBasement. Use the Bayesian information criterion in order to favor model simplicity.

  4. Inspect the model selected by stepwise regression, particularly checking the variance inflation factors and the effects of outliers.

\(K\)-fold target encoding introduces a random element into the training process, and as a result, the set of variables selected by stepwise regressions varied with successive iterations of the training process.

  • Stepwise regression consistently yielded around 27 variables, but two or three of the selected variables would typically change between iterations.

After the process was tuned, a handful of repetitions of the stepwise selection were performed, with a log generated for each repetition. One of the repetitions gave a somewhat lower variance inflation factor than the others, and the variables selected in this repetition were used as the starting point for lasso regression.

Outcome

outliers <- c(1299, 524, 633)
dummy_variables <- c('NoBasement', 'NoGarage', 'NoFireplace',
                     'NoSecondFloor')
encoded_categorical <- paste0(categorical_predictors, 'Num')
lm_predictors <- c(continuous_predictors, discrete_numeric,
                   encoded_categorical, dummy_variables)
# Drop LotFrontage and MasVnrArea because of missing values.
bool_index <- !(lm_predictors %in% c('LotFrontage', 'MasVnrArea'))
lm_predictors <- lm_predictors[bool_index]

# Boolean that determines whether stepwise regressions is run when the
# markdown file is rendered.
select_lm <- FALSE
if (select_lm) {
    # Note that if this code is executed as a single block, some of the output
    # that should be captured by the sink() command does not get captured.  In
    # practice, this code was executed interactively during an iterative process
    # of model development.  When commands are executed interactively while
    # waiting for one command to complete before executing the next, all output
    # is captured by the sink() command.

    # Because of the select command below, lm_train_data has no missing
    # values.
    lm_train_data <- train_data %>%
        select(all_of(lm_predictors), SalePrice, Id) %>%
        filter(!(Id %in% outliers))
    trivial_model <- lm(SalePrice ~ 1, data = lm_train_data)
    scope <- list(
        lower = formula(trivial_model),
        upper = as.formula(
            paste('SalePrice ~',
                  paste0(lm_predictors, collapse = ' + '))
        )
    )

    sink('stepwise_search_BIC.log')
    best_lm <- step(trivial_model, scope, direction = 'both',
                    k = log(nrow(lm_train_data)))
    print(summary(best_lm))
    best_lm_predictors <- names(best_lm$coefficients)[-1]
    cat('Predictors:\n\n')
    print(best_lm_predictors)
    cat('Variance inflation factors:\n\n')
    print(vif(best_lm))
    sink()
} else {
    best_lm_predictors <- c(
        'OverallQual', 'GrLivArea', 'NeighborhoodNum', 'BsmtFinSF1',
        'GarageArea', 'OverallCond', 'YearBuilt', 'TotalBsmtSF',
        'MSZoningNum', 'Fireplaces', 'LotArea', 'SaleConditionNum',
        'BldgTypeNum', 'Condition1Num', 'BsmtExposureNum', 'KitchenQualNum',
        'FunctionalNum', 'CentralAirNum', 'NoBasement', 'SecondFlrSF',
        'NoSecondFloor', 'ScreenPorch', 'GarageCondNum', 'WoodDeckSF',
        'HeatingQCNum', 'BsmtQualNum', 'BsmtFullBath'
    )
    stepwise_formula <- as.formula(
        paste('SalePrice ~',
              paste0(best_lm_predictors, collapse = ' + '))
    )
    lm_train_data <- train_data %>%
        select(all_of(best_lm_predictors), SalePrice, Id) %>%
        filter(!(Id %in% outliers))
    best_lm <- lm(stepwise_formula, lm_train_data)
    print(summary(best_lm))
    cat('Variance inflation factors:\n\n')
    print(vif(best_lm))
}

plot(best_lm)

Call:
lm(formula = stepwise_formula, data = lm_train_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.64609 -0.05462  0.00178  0.06366  0.47698 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -1.035e+01  1.249e+00  -8.286 2.66e-16 ***
OverallQual       5.430e-02  4.078e-03  13.315  < 2e-16 ***
GrLivArea         3.466e-01  2.281e-02  15.196  < 2e-16 ***
NeighborhoodNum   1.091e-01  1.715e-02   6.363 2.66e-10 ***
BsmtFinSF1        8.000e-05  9.829e-06   8.139 8.59e-16 ***
GarageArea        1.066e-04  2.088e-05   5.103 3.79e-07 ***
OverallCond       4.948e-02  3.213e-03  15.399  < 2e-16 ***
YearBuilt         1.088e-01  1.563e-02   6.960 5.17e-12 ***
TotalBsmtSF       1.323e-04  1.622e-05   8.158 7.38e-16 ***
MSZoningNum       1.583e-01  2.179e-02   7.263 6.22e-13 ***
Fireplaces        3.132e-02  5.539e-03   5.655 1.88e-08 ***
LotArea           5.504e-02  7.354e-03   7.484 1.26e-13 ***
SaleConditionNum  1.681e-01  2.334e-02   7.204 9.46e-13 ***
BldgTypeNum       1.050e-01  3.975e-02   2.643  0.00832 ** 
Condition1Num     2.140e-01  3.672e-02   5.829 6.88e-09 ***
BsmtExposureNum   1.253e-01  2.580e-02   4.856 1.33e-06 ***
KitchenQualNum    6.469e-02  1.625e-02   3.980 7.24e-05 ***
FunctionalNum     2.468e-01  6.263e-02   3.941 8.52e-05 ***
CentralAirNum     1.230e-01  2.400e-02   5.123 3.42e-07 ***
NoBasement        1.397e-01  2.726e-02   5.123 3.42e-07 ***
SecondFlrSF       1.223e-04  2.023e-05   6.044 1.92e-09 ***
NoSecondFloor     6.629e-02  1.600e-02   4.143 3.63e-05 ***
ScreenPorch       2.065e-04  5.353e-05   3.858  0.00012 ***
GarageCondNum     9.796e-02  2.412e-02   4.061 5.14e-05 ***
WoodDeckSF        7.194e-05  2.511e-05   2.865  0.00423 ** 
HeatingQCNum      5.489e-02  1.928e-02   2.847  0.00448 ** 
BsmtQualNum       5.797e-02  1.986e-02   2.919  0.00357 ** 
BsmtFullBath      2.161e-02  7.570e-03   2.855  0.00437 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.109 on 1429 degrees of freedom
Multiple R-squared:  0.9269,    Adjusted R-squared:  0.9255 
F-statistic: 671.3 on 27 and 1429 DF,  p-value: < 2.2e-16

Variance inflation factors:

     OverallQual        GrLivArea  NeighborhoodNum       BsmtFinSF1 
        3.863665         6.965263         3.298640         2.220058 
      GarageArea      OverallCond        YearBuilt      TotalBsmtSF 
        2.408877         1.569578         5.883387         5.553875 
     MSZoningNum       Fireplaces          LotArea SaleConditionNum 
        1.610007         1.550721         1.750785         1.320391 
     BldgTypeNum    Condition1Num  BsmtExposureNum   KitchenQualNum 
        1.254595         1.076210         1.871042         2.300401 
   FunctionalNum    CentralAirNum       NoBasement      SecondFlrSF 
        1.147687         1.397871         2.255891         9.514678 
   NoSecondFloor      ScreenPorch    GarageCondNum       WoodDeckSF 
        7.703462         1.093848         1.610258         1.214552 
    HeatingQCNum      BsmtQualNum     BsmtFullBath 
        1.687849         3.473138         1.880878 

3.2 Lasso regression

Summary

Fine tune the feature selection with lasso regression, and add regularization to the multilinear model.

Discussion

Cross validation with lasso regression was used to eliminate the variables NoSecondFloor and SecondFlrSF, which significantly improved the variance inflation factors for the model.

  • The command cv.glmnet was executed, and the coefficients at lambda.1se were checked for zeros. Although there was some variation in the outcome between iterations, NoSecondFloor and SecondFlrSF were consistently set to zero by this process. The choice to eliminate SecondFlrSF was supported by the fact it has a strong correlation with GrLivArea, probably the most important variable in the model.

After the feature selection was finalized, the cross validation was performed again to find the optimal regularization.

  • During a handful of iterations of the training process, the optimal value of lambda was consistently 0.0007.

Outcome

get_glmnet_inputs <- function(data, predictors, outliers = NULL) {
    if (!is.null(outliers)) {
        data <- data %>%
            filter(!(Id %in% outliers))
    }

    predictor_matrix <- data %>%
        select(all_of(predictors)) %>%
        data.matrix()

    if ('SalePrice' %in% colnames(data)) {
        target <- data$SalePrice
    } else {
        target <- NULL
    }

    inputs <- list(
        x = predictor_matrix,
        y = target
    )
    return(inputs)
}

glmnet_inputs <- get_glmnet_inputs(train_data, best_lm_predictors,
                                   outliers = outliers)
lasso_models <- glmnet(x = glmnet_inputs$x, y = glmnet_inputs$y,
                       alpha = 1, family = 'gaussian')
plot(lasso_models, xvar = "lambda", label = TRUE, main = "Lasso Regression")
cv_lasso_model <- cv.glmnet(x = glmnet_inputs$x, y = glmnet_inputs$y,
                            alpha = 1, family = 'gaussian',
                            nfolds = 10)
plot(cv_lasso_model, main = "Lasso Regression\n")
best_lambda <- cv_lasso_model$lambda.min

# Boolean that determins whether a test of feature selection is performed when
# the markdown file is rendered.
select_lasso_predictors <- FALSE
if (select_lasso_predictors) {
    # With lambda set to lambda.1se, two coefficients consistently go to zero.
    predict(lasso_models, s = cv_lasso_model$lambda.1se, type = 'coefficients')
} else {
    cat('Cross validation with lasso set the coefficients of SecondFlrSF and',
        'NoSecondFloor\nto zero.  The optimal value of lambda found by glmnet',
        'after these variables\nwere dropped was 0.0007.\n\n')

    best_formula <- update.formula(
        formula(best_lm),
        ~ . - NoSecondFloor - SecondFlrSF
    )
    best_lm <- lm(best_formula, lm_train_data)
    best_lm_predictors <- names(best_lm$coefficients)[-1]

    best_lambda <- 0.0007081519

    glmnet_inputs <- get_glmnet_inputs(train_data, best_lm_predictors,
                                       outliers = outliers)
    lasso_models <- glmnet(x = glmnet_inputs$x, y = glmnet_inputs$y,
                           alpha = 1, family = 'gaussian')

    print(summary(best_lm))
    cat('Variance inflation factors:\n\n')
    print(vif(best_lm))
    plot(best_lm)
}

Cross validation with lasso set the coefficients of SecondFlrSF and NoSecondFloor
to zero.  The optimal value of lambda found by glmnet after these variables
were dropped was 0.0007.


Call:
lm(formula = best_formula, data = lm_train_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.65752 -0.05643  0.00260  0.06484  0.44878 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -1.106e+01  1.249e+00  -8.856  < 2e-16 ***
OverallQual       5.374e-02  4.123e-03  13.034  < 2e-16 ***
GrLivArea         4.099e-01  1.344e-02  30.499  < 2e-16 ***
NeighborhoodNum   1.187e-01  1.728e-02   6.870 9.56e-12 ***
BsmtFinSF1        8.236e-05  9.937e-06   8.288 2.61e-16 ***
GarageArea        1.185e-04  2.094e-05   5.660 1.83e-08 ***
OverallCond       4.978e-02  3.250e-03  15.316  < 2e-16 ***
YearBuilt         1.102e-01  1.582e-02   6.966 4.96e-12 ***
TotalBsmtSF       9.964e-05  1.075e-05   9.272  < 2e-16 ***
MSZoningNum       1.574e-01  2.198e-02   7.160 1.29e-12 ***
Fireplaces        3.015e-02  5.596e-03   5.388 8.33e-08 ***
LotArea           5.600e-02  7.412e-03   7.554 7.47e-14 ***
SaleConditionNum  1.621e-01  2.359e-02   6.872 9.39e-12 ***
BldgTypeNum       1.121e-01  4.020e-02   2.789 0.005358 ** 
Condition1Num     2.144e-01  3.715e-02   5.772 9.56e-09 ***
BsmtExposureNum   1.220e-01  2.608e-02   4.676 3.20e-06 ***
KitchenQualNum    6.405e-02  1.645e-02   3.895 0.000103 ***
FunctionalNum     2.842e-01  6.261e-02   4.539 6.12e-06 ***
CentralAirNum     1.123e-01  2.419e-02   4.643 3.74e-06 ***
NoBasement        1.093e-01  2.452e-02   4.455 9.02e-06 ***
ScreenPorch       2.207e-04  5.412e-05   4.079 4.78e-05 ***
GarageCondNum     8.275e-02  2.427e-02   3.409 0.000670 ***
WoodDeckSF        7.595e-05  2.540e-05   2.991 0.002833 ** 
HeatingQCNum      6.174e-02  1.948e-02   3.170 0.001556 ** 
BsmtQualNum       6.299e-02  2.007e-02   3.139 0.001730 ** 
BsmtFullBath      2.092e-02  7.656e-03   2.733 0.006357 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1103 on 1431 degrees of freedom
Multiple R-squared:  0.9251,    Adjusted R-squared:  0.9237 
F-statistic: 706.5 on 25 and 1431 DF,  p-value: < 2.2e-16

Variance inflation factors:

     OverallQual        GrLivArea  NeighborhoodNum       BsmtFinSF1 
        3.855789         2.361504         3.270070         2.215681 
      GarageArea      OverallCond        YearBuilt      TotalBsmtSF 
        2.365536         1.567836         5.882052         2.380495 
     MSZoningNum       Fireplaces          LotArea SaleConditionNum 
        1.600096         1.545068         1.736580         1.317735 
     BldgTypeNum    Condition1Num  BsmtExposureNum   KitchenQualNum 
        1.252988         1.075697         1.866512         2.299578 
   FunctionalNum    CentralAirNum       NoBasement      ScreenPorch 
        1.120001         1.386972         1.782626         1.091664 
   GarageCondNum       WoodDeckSF     HeatingQCNum      BsmtQualNum 
        1.592409         1.213196         1.681964         3.461119 
    BsmtFullBath 
        1.878316 

3.3 Variable importance and visualization

Summary

Use the coefficients obtained with standardized input data to rank variables by importance, and visualize the selected variables.

Variable importance

scaled_lm_train_data <- as.data.frame(cbind(
    scale(glmnet_inputs$x),
    SalePrice = glmnet_inputs$y
))

# Note that executing summary(scaled_lm_train_data) is a good way to check
# whether there are large outliers in any of the selected predictors.

lm_model_scaled <- lm(best_formula, lm_train_data)
standardized_coeffs <- sort(lm_model_scaled$coefficients[-1],
                            decreasing = TRUE)

lm_predictors_by_importance <- names(standardized_coeffs)

# Curly braces used to force output into a single block in the rendered
# markdown.
{
    cat('Predictors in decreasing order of importance, along with the ',
        'coefficients obtained\nusing standardized training data.\n\n')
    print(standardized_coeffs, digits = 3)
}
Predictors in decreasing order of importance, along with the  coefficients obtained
using standardized training data.

       GrLivArea    FunctionalNum    Condition1Num SaleConditionNum 
        4.10e-01         2.84e-01         2.14e-01         1.62e-01 
     MSZoningNum  BsmtExposureNum  NeighborhoodNum    CentralAirNum 
        1.57e-01         1.22e-01         1.19e-01         1.12e-01 
     BldgTypeNum        YearBuilt       NoBasement    GarageCondNum 
        1.12e-01         1.10e-01         1.09e-01         8.27e-02 
  KitchenQualNum      BsmtQualNum     HeatingQCNum          LotArea 
        6.41e-02         6.30e-02         6.17e-02         5.60e-02 
     OverallQual      OverallCond       Fireplaces     BsmtFullBath 
        5.37e-02         4.98e-02         3.01e-02         2.09e-02 
     ScreenPorch       GarageArea      TotalBsmtSF       BsmtFinSF1 
        2.21e-04         1.19e-04         9.96e-05         8.24e-05 
      WoodDeckSF 
        7.59e-05 

Visualize variables

for (variable_name in lm_predictors_by_importance) {
    if (str_detect(variable_name, 'Num$')) {
        variable_name <- str_replace(variable_name, 'Num$', '')
        generate_box_plot(train_data, variable_name, show_encoding = TRUE)
    } else {
        dummy_filter <- get_dummy_filter(variable_name)
        # None of the dummy filters but NoBasement made it into the multilinear
        # model.
        if (dummy_filter != 'NoBasement') {
            dummy_filter <- ''
        }
        linear_model <- get_linear_model(variable_name, dummy_filter, outliers)
        plot_simple_fit(variable_name, linear_model)
    }
}

transformation:  area -> log(area) 

GrLivArea: Above grade (ground) living area square feet

Functional: Home functionality (Assume typical unless deductions are warranted)

       Typ      Typical Functionality
       Min1     Minor Deductions 1
       Min2     Minor Deductions 2
       Mod      Moderate Deductions
       Maj1     Major Deductions 1
       Maj2     Major Deductions 2
       Sev      Severely Damaged
       Sal      Salvage only

Condition1: Proximity to various conditions

       Artery   Adjacent to arterial street
       Feedr    Adjacent to feeder street
       Norm     Normal
       RRNn     Within 200' of North-South Railroad
       RRAn     Adjacent to North-South Railroad
       PosN     Near positive off-site feature--park, greenbelt, etc.
       PosA     Adjacent to postive off-site feature
       RRNe     Within 200' of East-West Railroad
       RRAe     Adjacent to East-West Railroad

SaleCondition: Condition of sale

       Normal   Normal Sale
       Abnorml  Abnormal Sale -  trade, foreclosure, short sale
       AdjLand  Adjoining Land Purchase
       Alloca   Allocation - two linked properties with separate deeds,
                typically condo with a garage unit
       Family   Sale between family members
       Partial  Home was not completed when last assessed
                (associated with New Homes)

MSZoning: Identifies the general zoning classification of the sale.

       A        Agriculture
       C        Commercial
       FV       Floating Village Residential
       I        Industrial
       RH       Residential High Density
       RL       Residential Low Density
       RP       Residential Low Density Park
       RM       Residential Medium Density

Neighborhood: Physical locations within Ames city limits

       Blmngtn  Bloomington Heights
       Blueste  Bluestem
       BrDale   Briardale
       BrkSide  Brookside
       ClearCr  Clear Creek
       CollgCr  College Creek
       Crawfor  Crawford
       Edwards  Edwards
       Gilbert  Gilbert
       IDOTRR   Iowa DOT and Rail Road
       MeadowV  Meadow Village
       Mitchel  Mitchell
       Names    North Ames
       NoRidge  Northridge
       NPkVill  Northpark Villa
       NridgHt  Northridge Heights
       NWAmes   Northwest Ames
       OldTown  Old Town
       SWISU    South & West of Iowa State University
       Sawyer   Sawyer
       SawyerW  Sawyer West
       Somerst  Somerset
       StoneBr  Stone Brook
       Timber   Timberland
       Veenker  Veenker

BsmtExposure: Refers to walkout or garden level walls

       Gd       Good Exposure
       Av       Average Exposure (split levels or foyers typically score average
                or above)
       Mn       Mimimum Exposure
       No       No Exposure
       NB       No Basement

NoBasement:  Does the house have a basement

CentralAir: Central air conditioning

       N        No
       Y        Yes

transformation:  year -> (year - 1900)^3 / 1e6 

YearBuilt: Original construction date

BldgType: Type of dwelling

       1Fam     Single-family Detached
       2FmCon   Two-family Conversion; originally built as one-family dwelling
       Duplx    Duplex
       TwnhsE   Townhouse End Unit
       TwnhsI   Townhouse Inside Unit

GarageCond: Garage condition

       Ex       Excellent
       Gd       Good
       TA       Typical/Average
       Fa       Fair
       Po       Poor
       No       No Garage

KitchenQual: Kitchen quality

       Ex       Excellent
       Gd       Good
       TA       Typical/Average
       Fa       Fair
       Po       Poor

HeatingQC: Heating quality and condition

       Ex       Excellent
       Gd       Good
       TA       Average/Typical
       Fa       Fair
       Po       Poor

BsmtQual: Evaluates the height of the basement

       Ex       Excellent (100+ inches)
       Gd       Good (90-99 inches)
       TA       Typical (80-89 inches)
       Fa       Fair (70-79 inches)
       Po       Poor (<70 inches
       NB       No Basement

transformation:  area -> log(area) 

LotArea: Lot size in square feet

OverallQual: Rates the overall material and finish of the house

       10       Very Excellent
       9        Excellent
       8        Very Good
       7        Good
       6        Above Average
       5        Average
       4        Below Average
       3        Fair
       2        Poor
       1        Very Poor

OverallCond: Rates the overall condition of the house

       10       Very Excellent
       9        Excellent
       8        Very Good
       7        Good
       6        Above Average
       5        Average
       4        Below Average
       3        Fair
       2        Poor
       1        Very Poor

Fireplaces: Number of fireplaces

BsmtFullBath: Basement full bathrooms

ScreenPorch: Screen porch area in square feet

GarageArea: Size of garage in square feet

TotalBsmtSF: Total square feet of basement area

BsmtFinSF1: Type 1 finished square feet

WoodDeckSF: Wood deck area in square feet

3.4 Model metrics

Summary

Compare error metrics for the training and test data.

Metrics

# During the process of target encoding, missing values in categorical
# variables are encoded with the mean of the target variable for the training
# data.
#
# Missing values in numeric variables will be imputed using means
# from the training data.  The function get_predictor_means returns a vector
# of stored means for the training data.  Note that the means are found for
# the 'raw' training data, rather than tranformed training data, since missing
# values in test data will be imputed before transformations are applied.
get_predictor_means <- function() {
    data <- read.csv('data/train.csv')
    means <- numeric(length(numeric_predictors))
    names(means) <- numeric_predictors
    for (variable_name in numeric_predictors) {
        means[variable_name] <- mean(
            data[[variable_name]],
            na.rm = TRUE
        )
    }
    return(means)
}

means <- get_predictor_means()
impute_vector <- function(vec, variable_name) {
    bool_index <- is.na(vec)
    vec[bool_index] <- means[variable_name]
    return(vec)
}
impute_data <- function(data, verbose = FALSE) {
    # Impute missing numerical values by the mean from the training data.
    for (variable_name in intersect(numeric_predictors, colnames(data))) {
        missing_count <- sum(is.na(data[[variable_name]]))
        if (missing_count > 0) {
            if (verbose) {
                cat('Imputing ', missing_count, ' missing value(s) in ',
                    variable_name, '.\n', sep = '')
            }
            data[[variable_name]] <- impute_vector(data[[variable_name]],
                                                   variable_name)
        }
    }

    return(data)
}

preprocess_lm <- function(data, drop_categorical = TRUE, verbose = FALSE) {
    data <- select(data, -LotFrontage, -MasVnrArea) %>%
        mutate(MSSubClass = as.character(MSSubClass)) %>%
        define_dummies()

    # Impute missing values in numeric predictors.
    data <- impute_data(data, verbose)

    data <- apply_transformations(data)

    # Encode categorical variables.
    data <- encode_predictors(data, 'mean',
                              drop_categorical = drop_categorical)

    return(data)
}

check_vector_lengths <- function(predicted, actual) {
    num_rows <- length(predicted)
    if (num_rows != length(actual)) {
        message <- paste0(
            'In check_vector_lengths, there is a mismatch in the vector ',
            'lengths.\npredicted:  ', num_rows, '\nactual:  ', length(actual),
            '\n'
        )
        stop(message)
    }
    return(TRUE)
}

# The functions get_rmse and get_rmsle evaluate the root-mean-square error and
# the room-mean-square logarithmic error.  The argument apply_exp should be
# TRUE if the target was transformed by taking the log.
get_rmse <- function(predicted, actual, apply_exp = TRUE) {
    check_vector_lengths(predicted, actual)
    if (apply_exp) {
        predicted <- exp(predicted)
        actual <- exp(actual)
    }
    num_rows <- length(predicted)
    mse <- sum((predicted - actual)^2) / num_rows
    return(sqrt(mse))
}

get_rmsle <- function(predicted, actual, apply_exp = TRUE) {
    check_vector_lengths(predicted, actual)
    if (apply_exp) {
        predicted <- exp(predicted)
        actual <- exp(actual)
    }
    num_rows <- length(predicted)
    msle <- sum((log(predicted + 1) - log(actual + 1))^2) / num_rows
    return(sqrt(msle))
}

report_lm_metrics <- function(glmnet_inputs) {
    predicted <- as.numeric(predict(
        lasso_models,
        s = best_lambda,
        newx = glmnet_inputs$x,
        type = 'response'
    ))
    rmse <- get_rmse(predicted = predicted, actual = glmnet_inputs$y)
    rmsle <- get_rmsle(predicted = predicted, actual = glmnet_inputs$y)
    cat('root-mean-square error:  ', round(rmse, digits = 2),
        '\nroot-mean-square logarithmic error:  ', round(rmsle, digits = 3),
        '\n\n', sep = '')
}

# Boolean that determins whether a prediction for the test data is generated
# when the markdown file is rendered.
predict_linear <- FALSE
if (predict_linear) {
    test_data <- preprocess_lm(test_data)
    glmnet_inputs <- get_glmnet_inputs(test_data, best_lm_predictors)
    linear_prediction <- exp(as.numeric(predict(
        lasso_models,
        s = best_lambda,
        newx = glmnet_inputs$x,
        type = 'response'
    )))
    to_save <- data.frame(Id = test_data$Id, SalePrice = linear_prediction)
    write.csv(to_save, 'data/linear_prediction.csv',
              row.names = FALSE, quote = FALSE)
}

# Curly braces used to force output into a single block in the rendered
# markdown.
{
    cat('Error metrics for the data used to train the model ',
        '(3 outliers excluded):\n\n',
        sep = '')
    report_lm_metrics(
        get_glmnet_inputs(train_data, best_lm_predictors, outliers = outliers)
    )
    cat('Error metrics for the full "training data set" ',
        '(no outliers excluded):\n\n',
        sep = '')
    report_lm_metrics(
        get_glmnet_inputs(train_data, best_lm_predictors)
    )

    cat('\nFor the test data set, the root-mean-square logarithmic ',
        'error was 0.12963.',
        sep = '')
}

# Save predicted values and residuals to be used in comparing the multilinear
# model and the random-forest model.
get_lm_result <- function() {
    glmnet_inputs <- get_glmnet_inputs(train_data, best_lm_predictors)
    predicted <- as.numeric(predict(
        lasso_models,
        s = best_lambda,
        newx = glmnet_inputs$x,
        type = 'response'
    ))
    residuals <- train_data$SalePrice - predicted
    result <- list(
        predicted = predicted,
        residuals = residuals
    )
    return(result)
}
lm_result <- get_lm_result()
Error metrics for the data used to train the model (3 outliers excluded):

root-mean-square error:  21568.87
root-mean-square logarithmic error:  0.109

Error metrics for the full "training data set" (no outliers excluded):

root-mean-square error:  44066.56
root-mean-square logarithmic error:  0.131


For the test data set, the root-mean-square logarithmic error was 0.12963.

3.5 Possible improvements

Summary

Discuss work that could be done to improve the multilinear model.

Discussion

Feature engineering for Condition1 and Condition2:

  • These two variables both give information about “proximity to various conditions,” such as parks or feeder streets. The variable Condition1 was selected as important for the model, while condition2 was not. From the visualization of Condition1, it is clear that some of these conditions are more important than others. It is possible that for some houses with two conditions, the information available in Condition1 may be less important than the information in Condition2. The model might be improved by defining a predictor that assigns a single condition to each house, choosing the most important when two conditions are present.

Research to understand important variables. Examples:

  • Condition1 has four categories associated with nearness to railroads. These distinguish between north-south railroads and east-west railroads, as well as between houses that are adjacent to a railroad or within 200 ft of a railroad. Of these four categories, the only one that clearly shows a price penalty is ‘adjacent to east-west-railroad.’ Is there an east-west railroad that is particularly active? Or is this just a random effect associated with the fact that the number of houses in these categories is relatively small.
  • Neighborhood has a relative importance of about 30%. What are the underlying reasons that some neighborhoods are preferred over others?

Better handling of outliers:

  • Even with three outliers excluded, the training data has a significant number of observations with large residuals, and these may be distorting the model predictions. Using a loss function such as the Huber loss that is robust to large residuals might give a better model by minimizing such distortions.

Better handling of dummy variables defined during feature engineering:

  • Stepwise regression tests the improvement associated with adding or subtracting individual variables. This process ignores they way in which combinations of two variables improve the model, e.g., NoGarage and GarageArea. Dummy variables such as NoGarage were defined specifically to improve the fit for related numerical variables, but the stepwise regression used for feature selection was “unaware” of this aspect of the dummy variables. This might be corrected by exploring different starting models for the stepwise regression.

3.6 Qualitative model

Summary

Create a simple qualitative summary of the model.

Discussion

The multilinear model includes 25 predictors, some of which are much less important than others, and so the “take-home messages” from the model may not be apparent.

  • Example: With standardized inputs, the coefficient for WoodDeckSf is smaller than the coefficient for GrLivArea by a factor of about 5000.

Extract the most important features from the model, and group these features into three levels by importance.

  • In reporting the results, suppress the technical distinction between the original variables and the transformed or encoded variables used in the multilinear model.
  • Note the training process includes a random element, and so the relative importance attached to each variable changes somewhat with each iteration of the training. As a result, the values shown in this subsection are somewhat different than the hard-coded values shown in Sec 1.1.

Primary variables

coef_relative_importance <- standardized_coeffs / standardized_coeffs[1]
extract_by_importance <- function(lower, upper) {
    bool_index <- (
        (coef_relative_importance <= upper) & (coef_relative_importance > lower)
    )
    extracted <- coef_relative_importance[bool_index]
    names(extracted) <- names(extracted) %>%
        str_replace('Num', '')
    return(extracted)
}
first_importance <- extract_by_importance(0.35, 1.0)
second_importance <- extract_by_importance(0.2, 0.35)
third_importance <- extract_by_importance(0.1, 0.2)


report_summary(first_importance)
GrLivArea: Above grade (ground) living area square feet

Relative importance: 100%
Comment: The log of sale price varies linearly with the log of living area.


Functional: Home functionality (Assume typical unless deductions are warranted)

       Typ      Typical Functionality
       Min1     Minor Deductions 1
       Min2     Minor Deductions 2
       Mod      Moderate Deductions
       Maj1     Major Deductions 1
       Maj2     Major Deductions 2
       Sev      Severely Damaged
       Sal      Salvage only

Relative importance: 69.3%
Comment: Any drop below typical functionality causes a drop in price, but
    the data does not show a clear dependence of price on the level
    of functionality loss.


Condition1: Proximity to various conditions

       Artery   Adjacent to arterial street
       Feedr    Adjacent to feeder street
       Norm     Normal
       RRNn     Within 200' of North-South Railroad
       RRAn     Adjacent to North-South Railroad
       PosN     Near positive off-site feature--park, greenbelt, etc.
       PosA     Adjacent to postive off-site feature
       RRNe     Within 200' of East-West Railroad
       RRAe     Adjacent to East-West Railroad

Relative importance: 52.3%
Comment: Being near or adjacent to a park, greenbelt, etc gives a boost
    in price, while being adjacent to a feeder street or arterial street
    gives a drop in price.  Research is needed to clarify the way
    in which proximity to railroads affects price.


SaleCondition: Condition of sale

       Normal   Normal Sale
       Abnorml  Abnormal Sale -  trade, foreclosure, short sale
       AdjLand  Adjoining Land Purchase
       Alloca   Allocation - two linked properties with separate deeds,
                typically condo with a garage unit
       Family   Sale between family members
       Partial  Home was not completed when last assessed
                (associated with New Homes)

Relative importance: 39.6%
Comment: Houses with an abnormal sale take a price penalty, while new homes
    in the Partial category have a price boost.


MSZoning: Identifies the general zoning classification of the sale.

       A        Agriculture
       C        Commercial
       FV       Floating Village Residential
       I        Industrial
       RH       Residential High Density
       RL       Residential Low Density
       RP       Residential Low Density Park
       RM       Residential Medium Density

Relative importance: 38.4%
Comment: Most houses are in a low-density residential zone.  Houses in a
    "Floating Village" residential zone get a boost in price, while
    houses in medium- or high-density residential zones get a price
    penalty.  Houses in commercial zones get a substantial price penalty.

Secondary variables

report_summary(second_importance)
BsmtExposure: Refers to walkout or garden level walls

       Gd       Good Exposure
       Av       Average Exposure (split levels or foyers typically score average
                or above)
       Mn       Mimimum Exposure
       No       No Exposure
       NB       No Basement

Relative importance: 29.8%


Neighborhood: Physical locations within Ames city limits

       Blmngtn  Bloomington Heights
       Blueste  Bluestem
       BrDale   Briardale
       BrkSide  Brookside
       ClearCr  Clear Creek
       CollgCr  College Creek
       Crawfor  Crawford
       Edwards  Edwards
       Gilbert  Gilbert
       IDOTRR   Iowa DOT and Rail Road
       MeadowV  Meadow Village
       Mitchel  Mitchell
       Names    North Ames
       NoRidge  Northridge
       NPkVill  Northpark Villa
       NridgHt  Northridge Heights
       NWAmes   Northwest Ames
       OldTown  Old Town
       SWISU    South & West of Iowa State University
       Sawyer   Sawyer
       SawyerW  Sawyer West
       Somerst  Somerset
       StoneBr  Stone Brook
       Timber   Timberland
       Veenker  Veenker

Relative importance: 29%


CentralAir: Central air conditioning

       N        No
       Y        Yes

Relative importance: 27.4%


BldgType: Type of dwelling

       1Fam     Single-family Detached
       2FmCon   Two-family Conversion; originally built as one-family dwelling
       Duplx    Duplex
       TwnhsE   Townhouse End Unit
       TwnhsI   Townhouse Inside Unit

Relative importance: 27.3%


YearBuilt: Original construction date

Relative importance: 26.9%


NoBasement:  Does the house have a basement

Relative importance: 26.7%


GarageCond: Garage condition

       Ex       Excellent
       Gd       Good
       TA       Typical/Average
       Fa       Fair
       Po       Poor
       No       No Garage

Relative importance: 20.2%

Minor variables

report_summary(third_importance)
KitchenQual: Kitchen quality

       Ex       Excellent
       Gd       Good
       TA       Typical/Average
       Fa       Fair
       Po       Poor

Relative importance: 15.6%


BsmtQual: Evaluates the height of the basement

       Ex       Excellent (100+ inches)
       Gd       Good (90-99 inches)
       TA       Typical (80-89 inches)
       Fa       Fair (70-79 inches)
       Po       Poor (<70 inches
       NB       No Basement

Relative importance: 15.4%


HeatingQC: Heating quality and condition

       Ex       Excellent
       Gd       Good
       TA       Average/Typical
       Fa       Fair
       Po       Poor

Relative importance: 15.1%


LotArea: Lot size in square feet

Relative importance: 13.7%


OverallQual: Rates the overall material and finish of the house

       10       Very Excellent
       9        Excellent
       8        Very Good
       7        Good
       6        Above Average
       5        Average
       4        Below Average
       3        Fair
       2        Poor
       1        Very Poor

Relative importance: 13.1%


OverallCond: Rates the overall condition of the house

       10       Very Excellent
       9        Excellent
       8        Very Good
       7        Good
       6        Above Average
       5        Average
       4        Below Average
       3        Fair
       2        Poor
       1        Very Poor

Relative importance: 12.1%

4 Random-forest model

Test whether a random-forest model can improve accuracy.

4.1 Feature selection

Summary

Define two feature sets for which hyperparameters will be tuned.

Discussion

For the multilinear model, stepwise regression was used to select important variables, and highly correlated predictors were eliminated with lasso regression. Can a similar approach be used for the random-forest model?

A process inspired by stepwise regression was used to select a feature set for the random-forest model, with the least important variable eliminated during each successive iteration.

A natural next step would be to remove highly correlated features from the feature set. As a simple alternative to developing an algorithm to do this, the features obtained for the multilinear model were tested in a random-forest model.

  • The predictor ‘NoBasement’ was not included in the random-forest model. The reason is that ‘NoBasement’ was added specifically in order to decrease distortions in a linear fit, which is not significant for a random forest.

Method

Prepare data:

  • Categorical variables were encoded using the method developed for the multilinear model.

  • The variables LotFrontage and MasVnrArea were dropped, since these variables each contain a relatively large number of missing values.

  • A few numeric variables in the test data had one or two missing values, and these were imputed using the mean.

Before beginning the stepwise process of feature selection, determine how large ntree needs to be to give stable results:

  • Both root-mean-square logarithmic error (RMSLE) and feature importance (evaluated by permuting predictor variables) should be stable.
  • A loop over values of ntree was performed using a model that included all predictors remaining after data preparation. The choice of least-important feature was the same for all values of ntree \(\geq 5000\). Changes in RMSLE as ntree was increased above 5000 were negligible.

Starting from a random-forest model that includes the 77 predictors remaining after data preparation, eliminate predictors in a stepwise process:

  • At each step, find the optimal value of mtry for a random forest that includes all of the remaining predictors. (In practice, the optimal value from a set of 11 tested values was found.)
  • Find the RMSLE for the model with the optimal value of mtry. Note that the predictions used to calculate RMSLE are based on out-of-bag samples.
  • Sort the predictors based on variable importance and drop the least important predictor.

A plot of RMSLE vs number of features showed a nearly linear pattern of decreasing error as the number of features dropped from 77 to 38. The set of 38 features that gave the minimum was was selected for the model.

  • The drop in RMSLE due to the process of feature selection was only 1.8%.

Outcome

# A fresh copy of the data will be used for the random-forest model.  In
# preparation for redoing the target encoding, source 'encode.R' again to
# reset variable_encodings to an empty list.
source('encode.R')
read_tree_data <- function(file) {
    data <- read.csv(file, stringsAsFactors = FALSE,
                     colClasses = c(MSSubClass = 'character'))
    return(data)
}

preprocess_tree <- function(data, encoding_method = 'mean',
                            outliers = NULL, drop_categorical = TRUE,
                            verbose = FALSE) {
    data <- select(data, -LotFrontage, -MasVnrArea)

    if (!is.null(outliers)) {
        data <- data %>%
            filter(!(Id %in% outliers))
    }

    # The target encoding that was explored for the multilinear model used
    # log(SalePrice), with the mean of log(SalePrice) given a weight for
    # classes with a small number of samples.  In order to use the same
    # encoding method, apply the log to SalePrice if it is present.
    if ('SalePrice' %in% colnames(data)) {
        data <- mutate(data, SalePrice = log(SalePrice))
    }

    # Impute missing values in numeric predictors.
    data <- impute_data(data, verbose)

    # Encode categorical variables.  Note that encodings assigned to are based
    # on log(SalePrice), although transformations are not applied to any
    # variables of the tree model
    data <- encode_predictors(data, method = encoding_method,
                              drop_categorical = drop_categorical)

    return(data)
}

train_data <- preprocess_tree(read_tree_data('data/train.csv'))
test_data <- preprocess_tree(read_tree_data('data/test.csv'))

all_tree_predictors <- setdiff(colnames(test_data), 'Id')

rf_features_from_lm <- setdiff(best_lm_predictors, 'NoBasement')

# Function to generate a random forest model using the training data.
run_rf <- function(predictors, ...) {
    x_data <- select(train_data, all_of(predictors))
    y_data <- train_data$SalePrice
    rf_model <- randomForest(x = x_data, y = y_data, importance = TRUE, ...)
    return(rf_model)
}

# Get a vector ordered by decreasing importance, where importance is evaluated
# by permuting predictor variables.
get_importance <- function(model) {
    importance_vec <- sort(
        importance(model)[, '%IncMSE'],
        decreasing = TRUE
    )
    return(importance_vec)
}

get_rf_error <- function(model) {
    rmsle <- get_rmsle(model$predicted, train_data$SalePrice)
    return(rmsle)
}

# For the random-forest model that includes all tree predictors, characterize
# the number of trees needed, based on changes
check_ntree <- FALSE
if (check_ntree) {
    ntree_vec <- c(500, seq(5000, 50000, by = 5000))
    ntree_results <- list()
    for (index in seq_along(ntree_vec)) {
        ntree <- ntree_vec[index]
        model <- run_rf(all_tree_predictors, ntree = ntree)
        importance_vec <- get_importance(model)
        error <- get_rf_error(model)

        ntree_results[[index]] <- list(
            ntree = ntree,
            importance_vec = importance_vec,
            error = error
        )

        filename <- paste0('importance_', ntree, '.txt')
        sink(filename)
        cat('RMSLE is:  ', error, '\n\nImportance:\n\n', sep = '')
        print(importance_vec)
        sink()
    }
}

# For all values of ntree >= 5000, the least important variable was the same.
# Improvements in rmsle for larger values of ntree were insignificant.
ntree <- 5000

# Encapsulate an individual step of eliminating the least important variable
# into a function.
drop_rf_feature <- function(features) {
    number_of_features <- length(features)
    cat('\nPerforming cross validation for a set of ',
        number_of_features,
        ' features.\n\n',
        sep = '')

    errors <- as.numeric(rep(1e6, number_of_features))
    importance_vectors <- list()

    default_mtry <- as.integer(number_of_features / 3)
    mtry_start <- max(2, default_mtry - 5)
    mtry_end <- min(number_of_features, default_mtry + 5)
    mtry_range <- seq(mtry_start, mtry_end)
    for (mtry in mtry_range) {
        cat('Testing mtry = ', mtry, '.\n', sep = '')
        model <- run_rf(features, ntree = ntree, mtry = mtry)
        errors[mtry] <- get_rf_error(model)
        importance_vectors[[mtry]] <- get_importance(model)
        cat('Error = ', errors[mtry], '.\n\n', sep = '')
    }

    plot(mtry_range, errors[mtry_range], type = 'l')

    index <- which.min(errors)
    error <- errors[index]
    importance_vec <- importance_vectors[[index]]

    mtry_index <- which.min(errors[mtry_range])
    cat('Selected mtry:  ', mtry_range[mtry_index],
        '.\n\nImportance vector:\n\n', sep = '')
    print(importance_vec)
    dropped_feature <- tail(names(importance_vec), 1)
    cat('\nDropped feature:  ', dropped_feature, '.\n\n', sep = '')
    features <- setdiff(features, dropped_feature)
    cat('Remaining features:\n\n')
    print(features)
    return(list(error = error, features = features))
}

select_features_rf <- FALSE
if (select_features_rf) {
    # Create the variables that will be updated in a loop.
    remaining_features <- all_tree_predictors
    number_of_features <- length(remaining_features)
    feature_set_errors <- as.numeric(rep(NA, number_of_features))
    feature_sets <- list()
    feature_sets[[number_of_features]] <- all_tree_predictors

    sink('select_rf_features.log', append = TRUE)
    number_of_iterations <- 65
    for (iteration in seq(1, number_of_iterations)) {
        cv_result <- drop_rf_feature(remaining_features)
        remaining_features <- cv_result$features
        number_of_features <- length(remaining_features)
        feature_set_errors[number_of_features + 1] <- cv_result$error
        feature_sets[[number_of_features]] <- remaining_features
    }
    selected_index <- which.min(feature_set_errors)
    rf_features_from_importance <- feature_sets[[selected_index]]
    cat('\nAfter 65 iterations, the feature set with the lowest root-mean-',
        'square logarithmic error\nusing out-of-bag samples for prediction ',
        'is the following:\n\n', sep = '')
    print(rf_features_from_importance)
    sink()
} else {
    feature_set_errors <- c(
        NA, NA, NA, NA, NA, NA,
        NA, NA, NA, NA, NA, NA,
        0.1344358, 0.1327722, 0.1333226, 0.1336695, 0.1337583, 0.1336552,
        0.1338953, 0.1339020, 0.1339117, 0.1343086, 0.1338769, 0.1340653,
        0.1341761, 0.1339882, 0.1333068, 0.1335836, 0.1332799, 0.1334367,
        0.1334917, 0.1335227, 0.1338326, 0.1338394, 0.1330482, 0.1335071,
        0.1325262, 0.1324736, 0.1326780, 0.1327063, 0.1327826, 0.1330062,
        0.1329979, 0.1330804, 0.1333393, 0.1332839, 0.1333281, 0.1332639,
        0.1335952, 0.1333970, 0.1334968, 0.1334317, 0.1332154, 0.1335923,
        0.1334465, 0.1335458, 0.1337589, 0.1336664, 0.1337066, 0.1338819,
        0.1338450, 0.1339195, 0.1337900, 0.1341650, 0.1341732, 0.1341367,
        0.1340504, 0.1343846, 0.1343936, 0.1345957, 0.1345700, 0.1346031,
        0.1345559, 0.1347084, 0.1345209, 0.1348670, 0.1348595
    )
    rf_features_from_importance <- c(
        'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
        'YearRemodAdd', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF',
        'FirstFlrSF', 'SecondFlrSF', 'GrLivArea', 'BsmtFullBath',
        'FullBath', 'HalfBath', 'BedroomAbvGr', 'TotRmsAbvGrd',
        'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea',
        'OpenPorchSF', 'MSSubClassNum', 'MSZoningNum', 'NeighborhoodNum',
        'HouseStyleNum', 'Exterior1stNum', 'Exterior2ndNum', 'ExterQualNum',
        'BsmtQualNum', 'BsmtCondNum', 'BsmtExposureNum', 'BsmtFinType1Num',
        'HeatingQCNum', 'CentralAirNum', 'KitchenQualNum', 'FireplaceQuNum',
        'GarageTypeNum', 'GarageFinishNum'
    )
}

plot(feature_set_errors,
     col = 'navyblue',
     pch = 19,
     ylab = 'Error',
     xlab = 'Number of features',
     main = paste0('Root-mean-square logarithmic error\nduring stepwise ',
                   'feature selection'))
{
    cat('After 65 iterations, the feature set with the lowest root-mean-',
        'square logarithmic\nerror using out-of-bag samples for prediction ',
        'is the following:\n\n', sep = '')
    print(rf_features_from_importance)
    cat('\n\nA second feature set that will be tested is obtained from ',
        'the multilinear model,\nwith the dummy variable NoBasement excluded:',
        '\n\n',
        sep = '')
    print(rf_features_from_lm)
}

After 65 iterations, the feature set with the lowest root-mean-square logarithmic
error using out-of-bag samples for prediction is the following:

 [1] "LotArea"         "OverallQual"     "OverallCond"     "YearBuilt"      
 [5] "YearRemodAdd"    "BsmtFinSF1"      "BsmtUnfSF"       "TotalBsmtSF"    
 [9] "FirstFlrSF"      "SecondFlrSF"     "GrLivArea"       "BsmtFullBath"   
[13] "FullBath"        "HalfBath"        "BedroomAbvGr"    "TotRmsAbvGrd"   
[17] "Fireplaces"      "GarageYrBlt"     "GarageCars"      "GarageArea"     
[21] "OpenPorchSF"     "MSSubClassNum"   "MSZoningNum"     "NeighborhoodNum"
[25] "HouseStyleNum"   "Exterior1stNum"  "Exterior2ndNum"  "ExterQualNum"   
[29] "BsmtQualNum"     "BsmtCondNum"     "BsmtExposureNum" "BsmtFinType1Num"
[33] "HeatingQCNum"    "CentralAirNum"   "KitchenQualNum"  "FireplaceQuNum" 
[37] "GarageTypeNum"   "GarageFinishNum"


A second feature set that will be tested is obtained from the multilinear model,
with the dummy variable NoBasement excluded:

 [1] "OverallQual"      "GrLivArea"        "NeighborhoodNum"  "BsmtFinSF1"      
 [5] "GarageArea"       "OverallCond"      "YearBuilt"        "TotalBsmtSF"     
 [9] "MSZoningNum"      "Fireplaces"       "LotArea"          "SaleConditionNum"
[13] "BldgTypeNum"      "Condition1Num"    "BsmtExposureNum"  "KitchenQualNum"  
[17] "FunctionalNum"    "CentralAirNum"    "ScreenPorch"      "GarageCondNum"   
[21] "WoodDeckSF"       "HeatingQCNum"     "BsmtQualNum"      "BsmtFullBath"    

4.2 Tune the model

Summary

Perform a grid search to tune the hyperparameters mtry and nodesize for each of the two feature sets.

Outcome

tune_rf_model <- function(predictors, nodesizes = NULL) {
    number_of_features <- length(predictors)
    default_mtry <- as.integer(number_of_features / 3)
    mtry_range <- seq(default_mtry - 5, default_mtry + 5)
    if (is.null(nodesizes)) {
        nodesizes <- c(3, 5, 10)
    }

    errors <- matrix(nrow = length(mtry_range),
                     ncol = length(nodesizes))
    rownames(errors) <- mtry_range
    colnames(errors) <- nodesizes

    for (nodesize_index in seq_along(nodesizes)) {
        nodesize <- nodesizes[nodesize_index]
        cat('\nTesting nodesize = ', nodesize, '.\n\n', sep = '')

        for (mtry_index in seq_along(mtry_range)) {
            mtry <- mtry_range[mtry_index]
            cat('Testing mtry = ', mtry, '.\n', sep = '')

            model <- run_rf(predictors, ntree = ntree, mtry = mtry,
                            nodesize = nodesize)
            rmsle <- get_rf_error(model)
            errors[mtry_index, nodesize_index] <- rmsle
            cat('Error = ', rmsle, '.\n\n', sep = '')
        }
        plot(mtry_range,
             errors[, nodesize_index],
             type = 'l',
             col = 'navyblue',
             xlab = 'mtry',
             ylab = 'Error',
             main = paste0('Root-mean-square logarithmic error,\n',
                           'nodesize = ',
                           nodesize))
    }
    result <- list(errors = errors,
                   nodesizes = nodesizes,
                   mtry_range = mtry_range)
    return(result)
}

tune_rf <- FALSE
if (tune_rf) {
    sink('tune_rf_model.log', append = TRUE)
    result <- tune_rf_model(rf_features_from_importance)
    errors <- result$errors
    min_indices <- which(errors == min(errors), arr.ind = TRUE)
    optimum_mtry <- result$mtry_range[min_indices[1]]
    optimum_nodesize <- result$nodesizes[min_indices[2]]
    min_error <- errors[min_indices]
    cat('\n\nThe optimized choices are:\n\nmtry = ', optimum_mtry,
        '\nnodesize = ', optimum_nodesize,
        '\n\nThe minimum RMSLE is: ', min_error,
        sep = '')
    sink()
}

check_lm_features <- FALSE
if (check_lm_features) {
    sink('tune_rf_with_lm_features.log', append = TRUE)
    result <- tune_rf_model(rf_features_from_lm, nodesizes = c(2, 3, 4))
    errors <- result$errors
    min_indices <- which(errors == min(errors), arr.ind = TRUE)
    optimum_mtry <- result$mtry_range[min_indices[1]]
    optimum_nodesize <- result$nodesizes[min_indices[2]]
    min_error <- errors[min_indices]
    cat('\n\nThe optimized choices are:\n\nmtry = ', optimum_mtry,
        '\nnodesize = ', optimum_nodesize,
        '\n\nThe minimum RMSLE is: ', min_error,
        sep = '')
    sink()
}

mtry_range <- seq(3, 13)
errors <- c(
    0.1355024, 0.1332273, 0.1317252, 0.1310732, 0.1307835, 0.1307519,
    0.1307294, 0.1303224, 0.1307371, 0.1308829, 0.1313430
)
plot(mtry_range, errors,
     type = 'l', col = 'navyblue',
     xlab = 'mtry', ylab = 'Error',
     main = 'Root-mean-square logarithmic error,\nnodesize = 3')
points(mtry_range, errors,
       pch = 19, col = 'navyblue')
{
    cat('The lowest RMLSE was obtained with mtry = 10, nodesize = 3 using the',
        'set\nof features obtained from the multilinear model:\n\n')
    print(rf_features_from_lm)
}

The lowest RMLSE was obtained with mtry = 10, nodesize = 3 using the set
of features obtained from the multilinear model:

 [1] "OverallQual"      "GrLivArea"        "NeighborhoodNum"  "BsmtFinSF1"      
 [5] "GarageArea"       "OverallCond"      "YearBuilt"        "TotalBsmtSF"     
 [9] "MSZoningNum"      "Fireplaces"       "LotArea"          "SaleConditionNum"
[13] "BldgTypeNum"      "Condition1Num"    "BsmtExposureNum"  "KitchenQualNum"  
[17] "FunctionalNum"    "CentralAirNum"    "ScreenPorch"      "GarageCondNum"   
[21] "WoodDeckSF"       "HeatingQCNum"     "BsmtQualNum"      "BsmtFullBath"    

4.3 Evaluate the model

Summary

Examine model metrics, plots of residuals, and variable importance.

Discussion

For the test data, the random-forest model had RMSLE 4.4% larger than the multilinear model.

  • Given the failure to improve on the accuracy of the multilinear model, it is natural to ask whether the feature engineering and feature selection for the random-forest model need to be improved.
  • In essence, the multilinear model did better with a feature set selected for it than the random-forest model did with almost the same feature set.
  • However, it is also striking that for all feature sets and hyperparameters tested, the RMSLE of the out-of-bag prediction was similar.

Residuals:

  • Most residuals are between -0.2 and 0.2, as in the multilinear model.
  • Plotting residuals vs fitted values for the random-forest and multilinear models clearly shows that the random-forest model does a much better job of handling outliers.

Although the two models have similar accuracy, the relative importance of the features is dramatically different in the models.

  • For instance, in the multilinear model with standardized inputs, TotalBsmtSF has a coefficient smaller than that of GrLivArea by a factor of about 4000. In the random-forest model, the importance score for TotalBsmtSF is about 64% as large as for GrLivArea.

Since the methods of evaluating importance for the two models are different, we don’t really have an “apples to apples” comparison. However, it is still interesting that the ordering of variable importance is dramatically different for the two models.

Model metrics

rf_model_filename <- 'tuned_rf_model.Rds'
if (!file.exists(rf_model_filename)) {
    cat('Generating random-forest model.\n')
    tuned_rf_model <- run_rf(rf_features_from_lm, ntree = ntree,
                             mtry = 10, nodesize = 3)
    saveRDS(tuned_rf_model, rf_model_filename)

    predict_rf <- FALSE
    if (predict_rf) {
        x_data <- select(test_data, all_of(rf_features_from_lm))
        rf_prediction <- exp(predict(tuned_rf_model,
                                     newdata = x_data,
                                     type = 'response'))
        to_save <- data.frame(Id = test_data$Id, SalePrice = rf_prediction)
        write.csv(to_save, 'data/rf_prediction.csv',
                  row.names = FALSE, quote = FALSE)
    }
} else {
    tuned_rf_model <- readRDS(rf_model_filename)
}

report_rf_metrics <- function(model) {
    rmsle <- get_rmsle(model$predicted, train_data$SalePrice)
    rmse <- get_rmse(model$predicted, train_data$SalePrice)
    cat('root-mean-square error:  ', round(rmse, digits = 2),
        '\nroot-mean-square logarithmic error:  ', round(rmsle, digits = 3),
        '\n\n', sep = '')
}
{
    cat('Error metrics for the training data (using out-of-bag samples for ',
        'prediction):\n\n',
        sep = '')
    report_rf_metrics(tuned_rf_model)

    cat('\nFor the test data set, the root-mean-square logarithmic ',
        'error (RMSLE) was 0.13533.\n',
        sep = '')

    rf_error <- 0.13533
    lm_error <- 0.12963
    cat('\nThe random-forest model had  RMSLE ',
        round((rf_error - lm_error) / lm_error * 100, digits = 1),
        '% larger than the multilinear model.',
        sep = '')

}
Error metrics for the training data (using out-of-bag samples for prediction):

root-mean-square error:  27657.38
root-mean-square logarithmic error:  0.132


For the test data set, the root-mean-square logarithmic error (RMSLE) was 0.13533.

The random-forest model had  RMSLE 4.4% larger than the multilinear model.

Residuals

rf_residuals <- train_data$SalePrice - tuned_rf_model$predicted
plot(tuned_rf_model$predicted,
     rf_residuals,
     col = 'navyblue',
     ylab = 'Residual',
     xlab = 'Fitted value',
     xlim = c(10.4, 14.6),
     main = 'Residuals of the tuned random-forest model')

# Compare to the residuals for the multilinear model, with outliers included.
plot(lm_result$predicted,
     lm_result$residuals,
     col = 'navyblue',
     ylab = 'Residual',
     xlab = 'Fitted value',
     xlim = c(10.4, 14.6),
     main = paste0('Residuals of the multilinear model\n',
                   '(including outliers excluded during training)'))

Variable importance

compare_importance <- function() {
    cat('Relative importance for the variables of the multilinear model:\n\n')
    print(coef_relative_importance)
    rf_importance <- get_importance(tuned_rf_model)
    rf_importance <- rf_importance / rf_importance[1]
    cat('\n\nRelative importance for the variables of the random-forest',
        'model:\n\n')
    print(rf_importance)
}
compare_importance()
Relative importance for the variables of the multilinear model:

       GrLivArea    FunctionalNum    Condition1Num SaleConditionNum 
    1.0000000000     0.6933092624     0.5231039284     0.3955126450 
     MSZoningNum  BsmtExposureNum  NeighborhoodNum    CentralAirNum 
    0.3839750470     0.2975020342     0.2896127678     0.2740503846 
     BldgTypeNum        YearBuilt       NoBasement    GarageCondNum 
    0.2734781468     0.2687625119     0.2665532587     0.2018605350 
  KitchenQualNum      BsmtQualNum     HeatingQCNum          LotArea 
    0.1562551331     0.1536517470     0.1506191967     0.1366009237 
     OverallQual      OverallCond       Fireplaces     BsmtFullBath 
    0.1311008473     0.1214350559     0.0735453148     0.0510394262 
     ScreenPorch       GarageArea      TotalBsmtSF       BsmtFinSF1 
    0.0005384634     0.0002891327     0.0002430738     0.0002009218 
      WoodDeckSF 
    0.0001852784 


Relative importance for the variables of the random-forest model:

       GrLivArea  NeighborhoodNum      TotalBsmtSF      OverallQual 
      1.00000000       0.66393635       0.64073969       0.58242229 
      BsmtFinSF1       GarageArea          LotArea      OverallCond 
      0.50910071       0.43868406       0.40724446       0.36917490 
       YearBuilt   KitchenQualNum       Fireplaces      BsmtQualNum 
      0.32621718       0.27154467       0.27123826       0.20746129 
     MSZoningNum     BsmtFullBath  BsmtExposureNum     HeatingQCNum 
      0.20278529       0.18682592       0.17854724       0.17542339 
   CentralAirNum       WoodDeckSF      BldgTypeNum      ScreenPorch 
      0.13642821       0.13604529       0.10319562       0.06596549 
   GarageCondNum SaleConditionNum    FunctionalNum    Condition1Num 
      0.06121935       0.04785998       0.04088203       0.03272141 

4.4 Possible improvements

Summary

Discuss ways in which the accuracy of the tree-based model could be improved.

Discussion

Test gradient boosting as an alternative to the random forest algorithm.

Extend the method of feature selection based on feature importance to include an algorithm for eliminating highly correlated predictors.

Modify the training and prediction methods to include an average over random elements of the \(k\)-fold target encoding.

  • The random elements can include both the selection of folds and the selection of encodings assigned to test data.

Check whether accuracy can be improved by using unencoded categorical variables.

Test imputation methods for missing data.